# spaCy NLP Service Prototype — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Build `nlp-service/` — a FastAPI service that parses free-text search queries into structured extractions (person names, role, dates, keywords) using spaCy, as a drop-in replacement for the current Ollama service. **Architecture:** Five-step pipeline (NER → role detection → date parsing → keyword extraction → assembly) in `extractor.py`. `main.py` exposes `/parse` and `/health` via FastAPI. Models baked into the Docker image at build time — no volume needed. **Tech Stack:** Python 3.11, FastAPI 0.115, spaCy 3.8 (`de_core_news_sm` / `en_core_web_sm` / `es_core_news_sm`), dateparser 1.2, pytest --- ## File Map | File | Responsibility | |---|---| | `nlp-service/models.py` | Pydantic request/response types — the extraction contract | | `nlp-service/extractor.py` | NLP pipeline: model loading + 5 extraction steps | | `nlp-service/main.py` | FastAPI app — `/parse`, `/health`, lifespan model loading | | `nlp-service/requirements.txt` | Python dependencies | | `nlp-service/Dockerfile` | Image — python:3.11-slim, models baked in, non-root user | | `nlp-service/CLAUDE.md` | Service-level docs | | `nlp-service/test_extractor.py` | Unit + integration tests for the pipeline | | `nlp-service/test_main.py` | HTTP contract tests for the FastAPI endpoints | --- ## Task 1: Scaffold — requirements.txt, CLAUDE.md, models.py **Files:** - Create: `nlp-service/requirements.txt` - Create: `nlp-service/CLAUDE.md` - Create: `nlp-service/models.py` - Create: `nlp-service/test_extractor.py` (skeleton only) - [ ] **Step 1: Create `nlp-service/requirements.txt`** ``` fastapi[standard]==0.115.6 uvicorn[standard]==0.34.0 spacy>=3.8,<4.0 dateparser>=1.2,<2.0 pytest>=8.0,<9.0 httpx>=0.28,<1.0 ``` - [ ] **Step 2: Create `nlp-service/CLAUDE.md`** ```markdown # NLP Service Lightweight FastAPI service that parses free-text search queries into structured extractions, replacing Ollama for the Familienarchiv NL search feature. ## Stack - Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2 ## Endpoints - `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract - `GET /health` — returns `{"status": "ok"}` when all models are loaded ## Running locally \`\`\`bash pip install -r requirements.txt python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm uvicorn main:app --reload --port 8001 curl -X POST http://localhost:8001/parse \ -H "Content-Type: application/json" \ -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' \`\`\` ## Testing \`\`\`bash pytest -v \`\`\` ## Design spec See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`. ## Notes This is a **prototype** for extraction quality evaluation. No docker-compose integration or Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in `backend/src/main/java/org/raddatz/familienarchiv/search/`. ``` - [ ] **Step 3: Write the failing test for Pydantic models** Create `nlp-service/test_extractor.py`: ```python import pytest from pydantic import ValidationError # ── Models ────────────────────────────────────────────────────────────────── def test_parse_request_valid(): from models import ParseRequest req = ParseRequest(query="Briefe von Opa", lang="de") assert req.query == "Briefe von Opa" assert req.lang == "de" def test_parse_request_rejects_unknown_lang(): from models import ParseRequest with pytest.raises(ValidationError): ParseRequest(query="Letters from grandpa", lang="fr") def test_parse_response_serializes_nulls(): from models import ParseResponse resp = ParseResponse( personNames=["Opa"], personRole="sender", dateFrom=None, dateTo="1920-12-31", keywords=["brief"], rawQuery="Briefe von Opa", ) data = resp.model_dump() assert data["dateFrom"] is None assert data["dateTo"] == "1920-12-31" assert data["personRole"] == "sender" ``` - [ ] **Step 4: Run to confirm failure** ```bash cd nlp-service pip install -r requirements.txt pytest test_extractor.py::test_parse_request_valid -v ``` Expected: `ModuleNotFoundError: No module named 'models'` - [ ] **Step 5: Create `nlp-service/models.py`** ```python from __future__ import annotations from typing import Literal from pydantic import BaseModel class ParseRequest(BaseModel): query: str lang: Literal["de", "en", "es"] class ParseResponse(BaseModel): personNames: list[str] personRole: Literal["sender", "receiver", "any"] dateFrom: str | None dateTo: str | None keywords: list[str] rawQuery: str ``` - [ ] **Step 6: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_parse_request_valid \ test_extractor.py::test_parse_request_rejects_unknown_lang \ test_extractor.py::test_parse_response_serializes_nulls -v ``` Expected: `3 passed` - [ ] **Step 7: Commit** ```bash git add nlp-service/ git commit -m "feat(nlp-service): scaffold — models, requirements, CLAUDE.md" ``` --- ## Task 2: spaCy model loading **Files:** - Create: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` Before running these tests, the three spaCy models must be installed: ```bash python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm ``` - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Model loading ──────────────────────────────────────────────────────────── import pytest @pytest.fixture(scope="session") def nlp_de(): from extractor import get_nlp return get_nlp("de") @pytest.fixture(scope="session") def nlp_en(): from extractor import get_nlp return get_nlp("en") @pytest.fixture(scope="session") def nlp_es(): from extractor import get_nlp return get_nlp("es") def test_get_nlp_de_loads(nlp_de): doc = nlp_de("Test") assert doc is not None def test_get_nlp_en_loads(nlp_en): doc = nlp_en("Test") assert doc is not None def test_get_nlp_es_loads(nlp_es): doc = nlp_es("Prueba") assert doc is not None def test_get_nlp_unknown_lang_raises(): from extractor import get_nlp with pytest.raises(ValueError, match="Unsupported language"): get_nlp("fr") ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_get_nlp_de_loads -v ``` Expected: `ModuleNotFoundError: No module named 'extractor'` - [ ] **Step 3: Create `nlp-service/extractor.py` with model loading** ```python from __future__ import annotations import re from datetime import date import dateparser import spacy from spacy.language import Language from models import ParseResponse # ── Language model registry ────────────────────────────────────────────────── _MODEL_NAMES: dict[str, str] = { "de": "de_core_news_sm", "en": "en_core_web_sm", "es": "es_core_news_sm", } _nlp_cache: dict[str, Language] = {} def get_nlp(lang: str) -> Language: if lang not in _MODEL_NAMES: raise ValueError(f"Unsupported language: {lang!r}. Valid: {list(_MODEL_NAMES)}") if lang not in _nlp_cache: _nlp_cache[lang] = spacy.load(_MODEL_NAMES[lang]) return _nlp_cache[lang] def load_all_models() -> None: for lang in _MODEL_NAMES: get_nlp(lang) ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_get_nlp_de_loads \ test_extractor.py::test_get_nlp_en_loads \ test_extractor.py::test_get_nlp_es_loads \ test_extractor.py::test_get_nlp_unknown_lang_raises -v ``` Expected: `4 passed` - [ ] **Step 5: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): spaCy model loading with get_nlp/load_all_models" ``` --- ## Task 3: Person name extraction (NER) **Files:** - Modify: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Person name extraction ─────────────────────────────────────────────────── def _make_doc_with_ents(nlp, text: str, char_ents: list[tuple[int, int, str]]): """Create a Doc with manually injected entity spans (no NER model needed).""" doc = nlp.make_doc(text) spans = [doc.char_span(s, e, label=lbl) for s, e, lbl in char_ents] doc.ents = [sp for sp in spans if sp is not None] return doc def test_extract_person_names_two_persons(nlp_de): from extractor import extract_person_names # "Briefe von Opa Hermann an Marie" # 0123456789012345678901234567890 # 1111111111222222222233 # "Opa Hermann" = 11..22, "Marie" = 26..31 doc = _make_doc_with_ents(nlp_de, "Briefe von Opa Hermann an Marie", [ (11, 22, "PER"), (26, 31, "PER"), ]) assert extract_person_names(doc) == ["Opa Hermann", "Marie"] def test_extract_person_names_preserves_order(nlp_de): from extractor import extract_person_names # Reversed: "Marie von Opa" — Marie comes first in text # "Marie" = 0..5, "Opa" = 10..13 doc = _make_doc_with_ents(nlp_de, "Marie von Opa", [ (0, 5, "PER"), (10, 13, "PER"), ]) assert extract_person_names(doc) == ["Marie", "Opa"] def test_extract_person_names_empty(nlp_de): from extractor import extract_person_names doc = _make_doc_with_ents(nlp_de, "Briefe aus dem Krieg", []) assert extract_person_names(doc) == [] def test_extract_person_names_ignores_non_per(nlp_de): from extractor import extract_person_names # DATE entity should not appear in personNames doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) assert extract_person_names(doc) == [] ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_extract_person_names_two_persons -v ``` Expected: `ImportError: cannot import name 'extract_person_names' from 'extractor'` - [ ] **Step 3: Add `extract_person_names` to `extractor.py`** Add after the model loading section: ```python # ── Step 1: Person name extraction ────────────────────────────────────────── def extract_person_names(doc) -> list[str]: """Return PER entity texts in left-to-right span order.""" return [ent.text for ent in doc.ents if ent.label_ == "PER"] ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_extract_person_names_two_persons \ test_extractor.py::test_extract_person_names_preserves_order \ test_extractor.py::test_extract_person_names_empty \ test_extractor.py::test_extract_person_names_ignores_non_per -v ``` Expected: `4 passed` - [ ] **Step 5: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): NER person name extraction" ``` --- ## Task 4: Role detection **Files:** - Modify: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` Role is only meaningful when exactly one PER entity is found. The function checks: 1. Dependency-tree children of the PER span's root with `dep_` in `("case", "prep", "mo")` 2. Fallback: the token immediately before the span - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Role detection ─────────────────────────────────────────────────────────── def test_role_sender_von(nlp_de): from extractor import detect_person_role # "Briefe von Marie" — "von" immediately before "Marie" # B=0..6, ' '=6, v=7..10, ' '=10, M=11..16 doc = _make_doc_with_ents(nlp_de, "Briefe von Marie", [(11, 16, "PER")]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "de") == "sender" def test_role_receiver_an(nlp_de): from extractor import detect_person_role # "Briefe an Marie" — "an" immediately before "Marie" # B=0..6, ' '=6, a=7..9, ' '=9, M=10..15 doc = _make_doc_with_ents(nlp_de, "Briefe an Marie", [(10, 15, "PER")]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "de") == "receiver" def test_role_two_persons_returns_any(nlp_de): from extractor import detect_person_role # "von Opa an Marie" — two PER spans → always "any" # v=0..3, ' '=3, O=4..7, ' '=7, a=8..10, ' '=10, M=11..16 doc = _make_doc_with_ents(nlp_de, "von Opa an Marie", [ (4, 7, "PER"), (11, 16, "PER"), ]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "de") == "any" def test_role_no_prep_returns_any(nlp_de): from extractor import detect_person_role # "Briefe Marie" — no preposition # B=0..6, ' '=6, M=7..12 doc = _make_doc_with_ents(nlp_de, "Briefe Marie", [(7, 12, "PER")]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "de") == "any" def test_role_empty_returns_any(nlp_de): from extractor import detect_person_role doc = _make_doc_with_ents(nlp_de, "Briefe 1920", []) assert detect_person_role(doc, [], "de") == "any" def test_role_sender_from_english(nlp_en): from extractor import detect_person_role # "letters from Marie" — "from" before "Marie" # l=0..7, ' '=7, f=8..12, ' '=12, M=13..18 doc = _make_doc_with_ents(nlp_en, "letters from Marie", [(13, 18, "PER")]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "en") == "sender" def test_role_receiver_to_english(nlp_en): from extractor import detect_person_role # "letters to Marie" # l=0..7, ' '=7, t=8..10, ' '=10, M=11..16 doc = _make_doc_with_ents(nlp_en, "letters to Marie", [(11, 16, "PER")]) per_spans = list(doc.ents) assert detect_person_role(doc, per_spans, "en") == "receiver" ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_role_sender_von -v ``` Expected: `ImportError: cannot import name 'detect_person_role' from 'extractor'` - [ ] **Step 3: Add role detection constants and function to `extractor.py`** Add after `extract_person_names`: ```python # ── Step 2: Role detection ─────────────────────────────────────────────────── _SENDER_PREPS: dict[str, frozenset[str]] = { "de": frozenset({"von", "vom"}), "en": frozenset({"from", "by"}), "es": frozenset({"de", "por"}), } _RECEIVER_PREPS: dict[str, frozenset[str]] = { "de": frozenset({"an", "nach", "für"}), "en": frozenset({"to", "for"}), "es": frozenset({"para", "a"}), } def detect_person_role(doc, per_spans: list, lang: str) -> str: """Return 'sender', 'receiver', or 'any'. Only meaningful for single-PER queries — two-person queries always return 'any' because Java derives direction from list position. """ if len(per_spans) != 1: return "any" span = per_spans[0] root = span.root sender = _SENDER_PREPS[lang] receiver = _RECEIVER_PREPS[lang] # Primary: dependency-tree children of the PER root for child in root.children: if child.dep_ in ("case", "prep", "mo"): if child.lower_ in sender: return "sender" if child.lower_ in receiver: return "receiver" # Fallback: token immediately before the span start if span.start > 0: prev = doc[span.start - 1] if prev.lower_ in sender: return "sender" if prev.lower_ in receiver: return "receiver" return "any" ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_role_sender_von \ test_extractor.py::test_role_receiver_an \ test_extractor.py::test_role_two_persons_returns_any \ test_extractor.py::test_role_no_prep_returns_any \ test_extractor.py::test_role_empty_returns_any \ test_extractor.py::test_role_sender_from_english \ test_extractor.py::test_role_receiver_to_english -v ``` Expected: `7 passed` - [ ] **Step 5: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): role detection (sender/receiver/any)" ``` --- ## Task 5: Date parsing **Files:** - Modify: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` Direction is detected from the token immediately before each DATE span. For "zwischen/between/entre", both DATE spans form the range (sorted so earlier = `dateFrom`). A bare year with no direction token produces a closed year-range (`dateFrom` = Jan 1, `dateTo` = Dec 31). Note: "nach" appears in both `_RECEIVER_PREPS["de"]` and the date-after set. This is safe — role detection only examines tokens before PER spans; date parsing only examines tokens before DATE spans. They operate on different span types. - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Date parsing ───────────────────────────────────────────────────────────── def test_date_vor_1920(nlp_de): from extractor import extract_dates # "Briefe vor 1920" — "1920" at chars 11..15 doc = _make_doc_with_ents(nlp_de, "Briefe vor 1920", [(11, 15, "DATE")]) date_from, date_to = extract_dates(doc, "de") assert date_from is None assert date_to == "1920-12-31" def test_date_nach_1900(nlp_de): from extractor import extract_dates # "Briefe nach 1900" — "1900" at chars 12..16 doc = _make_doc_with_ents(nlp_de, "Briefe nach 1900", [(12, 16, "DATE")]) date_from, date_to = extract_dates(doc, "de") assert date_from == "1900-01-01" assert date_to is None def test_date_zwischen_1900_und_1920(nlp_de): from extractor import extract_dates # "zwischen 1900 und 1920" # z=0..8, ' '=8, 1900=9..13, ' '=13, u=14..17, ' '=17, 1920=18..22 doc = _make_doc_with_ents(nlp_de, "zwischen 1900 und 1920", [ (9, 13, "DATE"), (18, 22, "DATE"), ]) date_from, date_to = extract_dates(doc, "de") assert date_from == "1900-01-01" assert date_to == "1920-12-31" def test_date_bare_year_makes_range(nlp_de): from extractor import extract_dates # "Briefe 1920" — no direction token → year-range # B=0..6, ' '=6, 1920=7..11 doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) date_from, date_to = extract_dates(doc, "de") assert date_from == "1920-01-01" assert date_to == "1920-12-31" def test_date_no_date_entity(nlp_de): from extractor import extract_dates doc = _make_doc_with_ents(nlp_de, "Briefe von Opa", []) date_from, date_to = extract_dates(doc, "de") assert date_from is None assert date_to is None def test_date_before_english(nlp_en): from extractor import extract_dates # "letters before 1920" — "1920" at chars 15..19 doc = _make_doc_with_ents(nlp_en, "letters before 1920", [(15, 19, "DATE")]) date_from, date_to = extract_dates(doc, "en") assert date_from is None assert date_to == "1920-12-31" def test_date_after_english(nlp_en): from extractor import extract_dates # "letters after 1900" — "1900" at chars 14..18 doc = _make_doc_with_ents(nlp_en, "letters after 1900", [(14, 18, "DATE")]) date_from, date_to = extract_dates(doc, "en") assert date_from == "1900-01-01" assert date_to is None ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_date_vor_1920 -v ``` Expected: `ImportError: cannot import name 'extract_dates' from 'extractor'` - [ ] **Step 3: Add date parsing to `extractor.py`** Add after `detect_person_role`: ```python # ── Step 3: Date parsing ───────────────────────────────────────────────────── _YEAR_RE = re.compile(r"^\d{4}$") _DATE_BEFORE: dict[str, frozenset[str]] = { "de": frozenset({"vor"}), "en": frozenset({"before"}), "es": frozenset({"antes"}), } _DATE_AFTER: dict[str, frozenset[str]] = { "de": frozenset({"nach"}), "en": frozenset({"after"}), "es": frozenset({"después", "despues"}), } _DATE_BETWEEN: dict[str, frozenset[str]] = { "de": frozenset({"zwischen"}), "en": frozenset({"between"}), "es": frozenset({"entre"}), } def _parse_date_text(text: str, lang: str) -> date | None: text = text.strip() if _YEAR_RE.match(text): year = int(text) if 1000 < year < 3000: return date(year, 1, 1) parsed = dateparser.parse( text, languages=[lang], settings={"PREFER_DAY_OF_MONTH": "first", "RETURN_AS_TIMEZONE_AWARE": False}, ) return parsed.date() if parsed else None def _year_end(d: date) -> date: """If d is Jan 1, return Dec 31 of the same year (year-only boundary).""" if d.month == 1 and d.day == 1: return date(d.year, 12, 31) return d def extract_dates(doc, lang: str) -> tuple[str | None, str | None]: """Return (date_from, date_to) as ISO strings or None.""" date_spans = [ent for ent in doc.ents if ent.label_ == "DATE"] if not date_spans: return None, None between_tokens = _DATE_BETWEEN[lang] before_tokens = _DATE_BEFORE[lang] after_tokens = _DATE_AFTER[lang] # "zwischen X und Y" / "between X and Y" — two DATE spans form a range has_between = any(tok.lower_ in between_tokens for tok in doc) if has_between and len(date_spans) >= 2: parsed = [] for span in date_spans[:2]: d = _parse_date_text(span.text, lang) if d: parsed.append(d) if len(parsed) == 2: parsed.sort() return parsed[0].isoformat(), _year_end(parsed[1]).isoformat() # Single DATE span — use direction token span = date_spans[0] d = _parse_date_text(span.text, lang) if not d: return None, None prev_lower = doc[span.start - 1].lower_ if span.start > 0 else "" if prev_lower in before_tokens: return None, _year_end(d).isoformat() if prev_lower in after_tokens: return d.isoformat(), None # Bare year/date — closed year-range return d.isoformat(), _year_end(d).isoformat() ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_date_vor_1920 \ test_extractor.py::test_date_nach_1900 \ test_extractor.py::test_date_zwischen_1900_und_1920 \ test_extractor.py::test_date_bare_year_makes_range \ test_extractor.py::test_date_no_date_entity \ test_extractor.py::test_date_before_english \ test_extractor.py::test_date_after_english -v ``` Expected: `7 passed` - [ ] **Step 5: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): date range extraction with direction detection" ``` --- ## Task 6: Keyword extraction **Files:** - Modify: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` Keywords are POS-filtered content words (NOUN or PROPN, non-stop, length ≥ 3, not inside any NER span). These are passed to Java's `resolveTags()` which fuzzy-matches them against the tag table — no tag lookup in Python. - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Keyword extraction ─────────────────────────────────────────────────────── def test_keywords_extracts_nouns(nlp_de): from extractor import extract_keywords # Use real NLP for POS tags; disable NER to control entities manually doc = nlp_de("Briefe aus dem Krieg", disable=["ner"]) keywords = extract_keywords(doc, []) # "Brief" (NOUN, lemma "Brief") and "Krieg" (NOUN) should appear assert "brief" in keywords assert "krieg" in keywords def test_keywords_excludes_stopwords(nlp_de): from extractor import extract_keywords doc = nlp_de("Briefe aus dem Krieg", disable=["ner"]) keywords = extract_keywords(doc, []) # "dem" is a stopword article (DET) — must not appear assert "dem" not in keywords def test_keywords_excludes_per_ner_spans(nlp_de): from extractor import extract_keywords # Run full NLP so POS tagger fires, then inject PER span over "Hermann" doc = nlp_de("Briefe von Hermann") per_span = doc.char_span(11, 18, label="PER") # "Hermann" = 11..18 if per_span: doc.ents = [per_span] keywords = extract_keywords(doc, list(doc.ents)) assert "hermann" not in keywords def test_keywords_excludes_short_lemmas(nlp_de): from extractor import extract_keywords # Single-letter / two-letter words should be excluded (length < 3) doc = nlp_de("Briefe an ihn", disable=["ner"]) keywords = extract_keywords(doc, []) assert "ihn" not in keywords def test_keywords_deduplicates(nlp_de): from extractor import extract_keywords doc = nlp_de("Brief Brief Krieg", disable=["ner"]) keywords = extract_keywords(doc, []) assert keywords.count("brief") == 1 ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_keywords_extracts_nouns -v ``` Expected: `ImportError: cannot import name 'extract_keywords' from 'extractor'` - [ ] **Step 3: Add keyword extraction to `extractor.py`** Add after `extract_dates`: ```python # ── Step 4: Keyword extraction ─────────────────────────────────────────────── def extract_keywords(doc, excluded_spans: list) -> list[str]: """Return lowercased lemmas of content words not inside any NER span.""" excluded_indices: set[int] = set() for span in excluded_spans: excluded_indices.update(range(span.start, span.end)) seen: set[str] = set() keywords: list[str] = [] for token in doc: if token.i in excluded_indices: continue if token.pos_ not in ("NOUN", "PROPN"): continue if token.is_stop: continue lemma = token.lemma_.lower() if len(lemma) < 3: continue if lemma not in seen: seen.add(lemma) keywords.append(lemma) return keywords ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_keywords_extracts_nouns \ test_extractor.py::test_keywords_excludes_stopwords \ test_extractor.py::test_keywords_excludes_per_ner_spans \ test_extractor.py::test_keywords_excludes_short_lemmas \ test_extractor.py::test_keywords_deduplicates -v ``` Expected: `5 passed` - [ ] **Step 5: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas)" ``` --- ## Task 7: Full `extract()` function **Files:** - Modify: `nlp-service/extractor.py` - Modify: `nlp-service/test_extractor.py` This assembles all steps. Tests here use **real NLP** (no synthetic docs) to validate actual extraction quality. - [ ] **Step 1: Write the failing tests** Append to `nlp-service/test_extractor.py`: ```python # ── Full extract() pipeline ────────────────────────────────────────────────── def test_extract_dates_de(): from extractor import extract result = extract("Briefe vor 1920", "de") assert result.dateFrom is None assert result.dateTo == "1920-12-31" assert result.rawQuery == "Briefe vor 1920" assert result.personNames == [] assert result.personRole == "any" def test_extract_keywords_from_topic_de(): from extractor import extract result = extract("Briefe aus dem Krieg", "de") assert "krieg" in result.keywords assert result.dateFrom is None assert result.dateTo is None def test_extract_dates_en(): from extractor import extract result = extract("letters before 1920", "en") assert result.dateTo == "1920-12-31" assert result.dateFrom is None def test_extract_dates_es(): from extractor import extract result = extract("cartas antes de 1920", "es") assert result.dateTo == "1920-12-31" assert result.dateFrom is None def test_extract_rawquery_echoed(): from extractor import extract q = "Texte über Weihnachten" result = extract(q, "de") assert result.rawQuery == q def test_extract_response_fields_are_complete(): from extractor import extract result = extract("Briefe 1900", "de") assert isinstance(result.personNames, list) assert result.personRole in ("sender", "receiver", "any") assert isinstance(result.keywords, list) assert result.rawQuery == "Briefe 1900" ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_extractor.py::test_extract_dates_de -v ``` Expected: `ImportError: cannot import name 'extract' from 'extractor'` - [ ] **Step 3: Add `extract()` to `extractor.py`** Add at the bottom of `extractor.py`: ```python # ── Step 5: Assembly ───────────────────────────────────────────────────────── def extract(query: str, lang: str) -> ParseResponse: """Run the full NLP pipeline and return a ParseResponse.""" nlp = get_nlp(lang) doc = nlp(query) per_spans = [ent for ent in doc.ents if ent.label_ == "PER"] person_names = extract_person_names(doc) person_role = detect_person_role(doc, per_spans, lang) date_from, date_to = extract_dates(doc, lang) keywords = extract_keywords(doc, list(doc.ents)) return ParseResponse( personNames=person_names, personRole=person_role, dateFrom=date_from, dateTo=date_to, keywords=keywords, rawQuery=query, ) ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_extractor.py::test_extract_dates_de \ test_extractor.py::test_extract_keywords_from_topic_de \ test_extractor.py::test_extract_dates_en \ test_extractor.py::test_extract_dates_es \ test_extractor.py::test_extract_rawquery_echoed \ test_extractor.py::test_extract_response_fields_are_complete -v ``` Expected: `6 passed` - [ ] **Step 5: Run the full test suite to confirm no regressions** ```bash pytest test_extractor.py -v ``` Expected: all tests pass - [ ] **Step 6: Commit** ```bash git add nlp-service/extractor.py nlp-service/test_extractor.py git commit -m "feat(nlp-service): full extract() pipeline — assembles all steps" ``` --- ## Task 8: FastAPI app **Files:** - Create: `nlp-service/main.py` - Create: `nlp-service/test_main.py` - [ ] **Step 1: Write the failing tests** Create `nlp-service/test_main.py`: ```python import pytest from fastapi.testclient import TestClient @pytest.fixture(scope="session") def client(): from main import app with TestClient(app) as c: yield c def test_health(client): response = client.get("/health") assert response.status_code == 200 assert response.json() == {"status": "ok"} def test_parse_returns_200_with_all_fields(client): response = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"}) assert response.status_code == 200 data = response.json() assert "personNames" in data assert "personRole" in data assert data["personRole"] in ("sender", "receiver", "any") assert "dateFrom" in data assert "dateTo" in data assert "keywords" in data assert "rawQuery" in data assert data["rawQuery"] == "Briefe vor 1920" assert data["dateTo"] == "1920-12-31" def test_parse_unknown_lang_returns_422(client): response = client.post("/parse", json={"query": "test", "lang": "fr"}) assert response.status_code == 422 def test_parse_missing_query_returns_422(client): response = client.post("/parse", json={"lang": "de"}) assert response.status_code == 422 def test_parse_all_languages(client): cases = [ ("de", "Briefe vor 1920"), ("en", "letters before 1920"), ("es", "cartas antes de 1920"), ] for lang, query in cases: response = client.post("/parse", json={"query": query, "lang": lang}) assert response.status_code == 200, f"Failed for lang={lang}" assert response.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}" ``` - [ ] **Step 2: Run to confirm failure** ```bash pytest test_main.py::test_health -v ``` Expected: `ModuleNotFoundError: No module named 'main'` - [ ] **Step 3: Create `nlp-service/main.py`** ```python import logging from contextlib import asynccontextmanager from fastapi import FastAPI, HTTPException from extractor import extract, load_all_models from models import ParseRequest, ParseResponse logger = logging.getLogger(__name__) @asynccontextmanager async def lifespan(app: FastAPI): logger.info("Loading spaCy models...") load_all_models() logger.info("All models ready.") yield app = FastAPI(lifespan=lifespan) @app.get("/health") def health() -> dict: return {"status": "ok"} @app.post("/parse", response_model=ParseResponse) def parse(request: ParseRequest) -> ParseResponse: try: return extract(request.query, request.lang) except Exception as exc: logger.exception("Extraction pipeline failed") raise HTTPException(status_code=500, detail=str(exc)) from exc ``` - [ ] **Step 4: Run tests to confirm they pass** ```bash pytest test_main.py -v ``` Expected: `5 passed` - [ ] **Step 5: Run the full test suite** ```bash pytest -v ``` Expected: all tests pass - [ ] **Step 6: Smoke-test the running service** ```bash uvicorn main:app --reload --port 8001 & sleep 2 curl -s http://localhost:8001/health # Expected: {"status":"ok"} curl -s -X POST http://localhost:8001/parse \ -H "Content-Type: application/json" \ -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' | python3 -m json.tool # Expected (spaCy may or may not catch "Opa Hermann"/"Marie" as PER): # { # "personNames": [...], # "personRole": "any", # "dateFrom": null, # "dateTo": "1920-12-31", # "keywords": ["brief"], # "rawQuery": "Briefe von Opa Hermann an Marie vor 1920" # } kill %1 ``` - [ ] **Step 7: Commit** ```bash git add nlp-service/main.py nlp-service/test_main.py git commit -m "feat(nlp-service): FastAPI app with /parse and /health endpoints" ``` --- ## Task 9: Dockerfile **Files:** - Create: `nlp-service/Dockerfile` - [ ] **Step 1: Create `nlp-service/Dockerfile`** ```dockerfile FROM python:3.11-slim WORKDIR /app RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && rm -rf /var/lib/apt/lists/* COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Bake models into the image — no volume needed, ~350 MB total RUN python -m spacy download de_core_news_sm \ && python -m spacy download en_core_web_sm \ && python -m spacy download es_core_news_sm COPY . . RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \ && chown -R nlp:nlp /app USER nlp EXPOSE 8001 HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \ CMD curl -f http://localhost:8001/health || exit 1 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"] ``` - [ ] **Step 2: Build the image** ```bash cd nlp-service docker build -t nlp-service:prototype . ``` Expected: build completes, image ~350 MB - [ ] **Step 3: Run and smoke-test the container** ```bash docker run --rm -d -p 8001:8001 --name nlp-test nlp-service:prototype sleep 5 curl -s http://localhost:8001/health # Expected: {"status":"ok"} curl -s -X POST http://localhost:8001/parse \ -H "Content-Type: application/json" \ -d '{"query": "Briefe aus dem Krieg", "lang": "de"}' | python3 -m json.tool docker stop nlp-test ``` - [ ] **Step 4: Commit** ```bash git add nlp-service/Dockerfile git commit -m "feat(nlp-service): Dockerfile — python:3.11-slim, models baked in" ```