diff --git a/docs/superpowers/plans/2026-06-07-spacy-nlp-service.md b/docs/superpowers/plans/2026-06-07-spacy-nlp-service.md new file mode 100644 index 00000000..fde4ac7e --- /dev/null +++ b/docs/superpowers/plans/2026-06-07-spacy-nlp-service.md @@ -0,0 +1,1257 @@ +# spaCy NLP Service Prototype — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build `nlp-service/` — a FastAPI service that parses free-text search queries into structured extractions (person names, role, dates, keywords) using spaCy, as a drop-in replacement for the current Ollama service. + +**Architecture:** Five-step pipeline (NER → role detection → date parsing → keyword extraction → assembly) in `extractor.py`. `main.py` exposes `/parse` and `/health` via FastAPI. Models baked into the Docker image at build time — no volume needed. + +**Tech Stack:** Python 3.11, FastAPI 0.115, spaCy 3.8 (`de_core_news_sm` / `en_core_web_sm` / `es_core_news_sm`), dateparser 1.2, pytest + +--- + +## File Map + +| File | Responsibility | +|---|---| +| `nlp-service/models.py` | Pydantic request/response types — the extraction contract | +| `nlp-service/extractor.py` | NLP pipeline: model loading + 5 extraction steps | +| `nlp-service/main.py` | FastAPI app — `/parse`, `/health`, lifespan model loading | +| `nlp-service/requirements.txt` | Python dependencies | +| `nlp-service/Dockerfile` | Image — python:3.11-slim, models baked in, non-root user | +| `nlp-service/CLAUDE.md` | Service-level docs | +| `nlp-service/test_extractor.py` | Unit + integration tests for the pipeline | +| `nlp-service/test_main.py` | HTTP contract tests for the FastAPI endpoints | + +--- + +## Task 1: Scaffold — requirements.txt, CLAUDE.md, models.py + +**Files:** +- Create: `nlp-service/requirements.txt` +- Create: `nlp-service/CLAUDE.md` +- Create: `nlp-service/models.py` +- Create: `nlp-service/test_extractor.py` (skeleton only) + +- [ ] **Step 1: Create `nlp-service/requirements.txt`** + +``` +fastapi[standard]==0.115.6 +uvicorn[standard]==0.34.0 +spacy>=3.8,<4.0 +dateparser>=1.2,<2.0 +pytest>=8.0,<9.0 +httpx>=0.28,<1.0 +``` + +- [ ] **Step 2: Create `nlp-service/CLAUDE.md`** + +```markdown +# NLP Service + +Lightweight FastAPI service that parses free-text search queries into structured extractions, +replacing Ollama for the Familienarchiv NL search feature. + +## Stack + +- Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2 + +## Endpoints + +- `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract +- `GET /health` — returns `{"status": "ok"}` when all models are loaded + +## Running locally + +\`\`\`bash +pip install -r requirements.txt +python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm +uvicorn main:app --reload --port 8001 + +curl -X POST http://localhost:8001/parse \ + -H "Content-Type: application/json" \ + -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' +\`\`\` + +## Testing + +\`\`\`bash +pytest -v +\`\`\` + +## Design spec + +See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`. + +## Notes + +This is a **prototype** for extraction quality evaluation. No docker-compose integration or +Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in +`backend/src/main/java/org/raddatz/familienarchiv/search/`. +``` + +- [ ] **Step 3: Write the failing test for Pydantic models** + +Create `nlp-service/test_extractor.py`: + +```python +import pytest +from pydantic import ValidationError + + +# ── Models ────────────────────────────────────────────────────────────────── + +def test_parse_request_valid(): + from models import ParseRequest + req = ParseRequest(query="Briefe von Opa", lang="de") + assert req.query == "Briefe von Opa" + assert req.lang == "de" + + +def test_parse_request_rejects_unknown_lang(): + from models import ParseRequest + with pytest.raises(ValidationError): + ParseRequest(query="Letters from grandpa", lang="fr") + + +def test_parse_response_serializes_nulls(): + from models import ParseResponse + resp = ParseResponse( + personNames=["Opa"], + personRole="sender", + dateFrom=None, + dateTo="1920-12-31", + keywords=["brief"], + rawQuery="Briefe von Opa", + ) + data = resp.model_dump() + assert data["dateFrom"] is None + assert data["dateTo"] == "1920-12-31" + assert data["personRole"] == "sender" +``` + +- [ ] **Step 4: Run to confirm failure** + +```bash +cd nlp-service +pip install -r requirements.txt +pytest test_extractor.py::test_parse_request_valid -v +``` + +Expected: `ModuleNotFoundError: No module named 'models'` + +- [ ] **Step 5: Create `nlp-service/models.py`** + +```python +from __future__ import annotations +from typing import Literal +from pydantic import BaseModel + + +class ParseRequest(BaseModel): + query: str + lang: Literal["de", "en", "es"] + + +class ParseResponse(BaseModel): + personNames: list[str] + personRole: Literal["sender", "receiver", "any"] + dateFrom: str | None + dateTo: str | None + keywords: list[str] + rawQuery: str +``` + +- [ ] **Step 6: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_parse_request_valid \ + test_extractor.py::test_parse_request_rejects_unknown_lang \ + test_extractor.py::test_parse_response_serializes_nulls -v +``` + +Expected: `3 passed` + +- [ ] **Step 7: Commit** + +```bash +git add nlp-service/ +git commit -m "feat(nlp-service): scaffold — models, requirements, CLAUDE.md" +``` + +--- + +## Task 2: spaCy model loading + +**Files:** +- Create: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +Before running these tests, the three spaCy models must be installed: + +```bash +python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm +``` + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Model loading ──────────────────────────────────────────────────────────── + +import pytest + + +@pytest.fixture(scope="session") +def nlp_de(): + from extractor import get_nlp + return get_nlp("de") + + +@pytest.fixture(scope="session") +def nlp_en(): + from extractor import get_nlp + return get_nlp("en") + + +@pytest.fixture(scope="session") +def nlp_es(): + from extractor import get_nlp + return get_nlp("es") + + +def test_get_nlp_de_loads(nlp_de): + doc = nlp_de("Test") + assert doc is not None + + +def test_get_nlp_en_loads(nlp_en): + doc = nlp_en("Test") + assert doc is not None + + +def test_get_nlp_es_loads(nlp_es): + doc = nlp_es("Prueba") + assert doc is not None + + +def test_get_nlp_unknown_lang_raises(): + from extractor import get_nlp + with pytest.raises(ValueError, match="Unsupported language"): + get_nlp("fr") +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_get_nlp_de_loads -v +``` + +Expected: `ModuleNotFoundError: No module named 'extractor'` + +- [ ] **Step 3: Create `nlp-service/extractor.py` with model loading** + +```python +from __future__ import annotations + +import re +from datetime import date + +import dateparser +import spacy +from spacy.language import Language + +from models import ParseResponse + +# ── Language model registry ────────────────────────────────────────────────── + +_MODEL_NAMES: dict[str, str] = { + "de": "de_core_news_sm", + "en": "en_core_web_sm", + "es": "es_core_news_sm", +} + +_nlp_cache: dict[str, Language] = {} + + +def get_nlp(lang: str) -> Language: + if lang not in _MODEL_NAMES: + raise ValueError(f"Unsupported language: {lang!r}. Valid: {list(_MODEL_NAMES)}") + if lang not in _nlp_cache: + _nlp_cache[lang] = spacy.load(_MODEL_NAMES[lang]) + return _nlp_cache[lang] + + +def load_all_models() -> None: + for lang in _MODEL_NAMES: + get_nlp(lang) +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_get_nlp_de_loads \ + test_extractor.py::test_get_nlp_en_loads \ + test_extractor.py::test_get_nlp_es_loads \ + test_extractor.py::test_get_nlp_unknown_lang_raises -v +``` + +Expected: `4 passed` + +- [ ] **Step 5: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): spaCy model loading with get_nlp/load_all_models" +``` + +--- + +## Task 3: Person name extraction (NER) + +**Files:** +- Modify: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Person name extraction ─────────────────────────────────────────────────── + +def _make_doc_with_ents(nlp, text: str, char_ents: list[tuple[int, int, str]]): + """Create a Doc with manually injected entity spans (no NER model needed).""" + doc = nlp.make_doc(text) + spans = [doc.char_span(s, e, label=lbl) for s, e, lbl in char_ents] + doc.ents = [sp for sp in spans if sp is not None] + return doc + + +def test_extract_person_names_two_persons(nlp_de): + from extractor import extract_person_names + # "Briefe von Opa Hermann an Marie" + # 0123456789012345678901234567890 + # 1111111111222222222233 + # "Opa Hermann" = 11..22, "Marie" = 26..31 + doc = _make_doc_with_ents(nlp_de, "Briefe von Opa Hermann an Marie", [ + (11, 22, "PER"), + (26, 31, "PER"), + ]) + assert extract_person_names(doc) == ["Opa Hermann", "Marie"] + + +def test_extract_person_names_preserves_order(nlp_de): + from extractor import extract_person_names + # Reversed: "Marie von Opa" — Marie comes first in text + # "Marie" = 0..5, "Opa" = 10..13 + doc = _make_doc_with_ents(nlp_de, "Marie von Opa", [ + (0, 5, "PER"), + (10, 13, "PER"), + ]) + assert extract_person_names(doc) == ["Marie", "Opa"] + + +def test_extract_person_names_empty(nlp_de): + from extractor import extract_person_names + doc = _make_doc_with_ents(nlp_de, "Briefe aus dem Krieg", []) + assert extract_person_names(doc) == [] + + +def test_extract_person_names_ignores_non_per(nlp_de): + from extractor import extract_person_names + # DATE entity should not appear in personNames + doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) + assert extract_person_names(doc) == [] +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_extract_person_names_two_persons -v +``` + +Expected: `ImportError: cannot import name 'extract_person_names' from 'extractor'` + +- [ ] **Step 3: Add `extract_person_names` to `extractor.py`** + +Add after the model loading section: + +```python +# ── Step 1: Person name extraction ────────────────────────────────────────── + +def extract_person_names(doc) -> list[str]: + """Return PER entity texts in left-to-right span order.""" + return [ent.text for ent in doc.ents if ent.label_ == "PER"] +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_extract_person_names_two_persons \ + test_extractor.py::test_extract_person_names_preserves_order \ + test_extractor.py::test_extract_person_names_empty \ + test_extractor.py::test_extract_person_names_ignores_non_per -v +``` + +Expected: `4 passed` + +- [ ] **Step 5: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): NER person name extraction" +``` + +--- + +## Task 4: Role detection + +**Files:** +- Modify: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +Role is only meaningful when exactly one PER entity is found. The function checks: +1. Dependency-tree children of the PER span's root with `dep_` in `("case", "prep", "mo")` +2. Fallback: the token immediately before the span + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Role detection ─────────────────────────────────────────────────────────── + +def test_role_sender_von(nlp_de): + from extractor import detect_person_role + # "Briefe von Marie" — "von" immediately before "Marie" + # B=0..6, ' '=6, v=7..10, ' '=10, M=11..16 + doc = _make_doc_with_ents(nlp_de, "Briefe von Marie", [(11, 16, "PER")]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "de") == "sender" + + +def test_role_receiver_an(nlp_de): + from extractor import detect_person_role + # "Briefe an Marie" — "an" immediately before "Marie" + # B=0..6, ' '=6, a=7..9, ' '=9, M=10..15 + doc = _make_doc_with_ents(nlp_de, "Briefe an Marie", [(10, 15, "PER")]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "de") == "receiver" + + +def test_role_two_persons_returns_any(nlp_de): + from extractor import detect_person_role + # "von Opa an Marie" — two PER spans → always "any" + # v=0..3, ' '=3, O=4..7, ' '=7, a=8..10, ' '=10, M=11..16 + doc = _make_doc_with_ents(nlp_de, "von Opa an Marie", [ + (4, 7, "PER"), + (11, 16, "PER"), + ]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "de") == "any" + + +def test_role_no_prep_returns_any(nlp_de): + from extractor import detect_person_role + # "Briefe Marie" — no preposition + # B=0..6, ' '=6, M=7..12 + doc = _make_doc_with_ents(nlp_de, "Briefe Marie", [(7, 12, "PER")]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "de") == "any" + + +def test_role_empty_returns_any(nlp_de): + from extractor import detect_person_role + doc = _make_doc_with_ents(nlp_de, "Briefe 1920", []) + assert detect_person_role(doc, [], "de") == "any" + + +def test_role_sender_from_english(nlp_en): + from extractor import detect_person_role + # "letters from Marie" — "from" before "Marie" + # l=0..7, ' '=7, f=8..12, ' '=12, M=13..18 + doc = _make_doc_with_ents(nlp_en, "letters from Marie", [(13, 18, "PER")]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "en") == "sender" + + +def test_role_receiver_to_english(nlp_en): + from extractor import detect_person_role + # "letters to Marie" + # l=0..7, ' '=7, t=8..10, ' '=10, M=11..16 + doc = _make_doc_with_ents(nlp_en, "letters to Marie", [(11, 16, "PER")]) + per_spans = list(doc.ents) + assert detect_person_role(doc, per_spans, "en") == "receiver" +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_role_sender_von -v +``` + +Expected: `ImportError: cannot import name 'detect_person_role' from 'extractor'` + +- [ ] **Step 3: Add role detection constants and function to `extractor.py`** + +Add after `extract_person_names`: + +```python +# ── Step 2: Role detection ─────────────────────────────────────────────────── + +_SENDER_PREPS: dict[str, frozenset[str]] = { + "de": frozenset({"von", "vom"}), + "en": frozenset({"from", "by"}), + "es": frozenset({"de", "por"}), +} + +_RECEIVER_PREPS: dict[str, frozenset[str]] = { + "de": frozenset({"an", "nach", "für"}), + "en": frozenset({"to", "for"}), + "es": frozenset({"para", "a"}), +} + + +def detect_person_role(doc, per_spans: list, lang: str) -> str: + """Return 'sender', 'receiver', or 'any'. + + Only meaningful for single-PER queries — two-person queries always return + 'any' because Java derives direction from list position. + """ + if len(per_spans) != 1: + return "any" + + span = per_spans[0] + root = span.root + sender = _SENDER_PREPS[lang] + receiver = _RECEIVER_PREPS[lang] + + # Primary: dependency-tree children of the PER root + for child in root.children: + if child.dep_ in ("case", "prep", "mo"): + if child.lower_ in sender: + return "sender" + if child.lower_ in receiver: + return "receiver" + + # Fallback: token immediately before the span start + if span.start > 0: + prev = doc[span.start - 1] + if prev.lower_ in sender: + return "sender" + if prev.lower_ in receiver: + return "receiver" + + return "any" +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_role_sender_von \ + test_extractor.py::test_role_receiver_an \ + test_extractor.py::test_role_two_persons_returns_any \ + test_extractor.py::test_role_no_prep_returns_any \ + test_extractor.py::test_role_empty_returns_any \ + test_extractor.py::test_role_sender_from_english \ + test_extractor.py::test_role_receiver_to_english -v +``` + +Expected: `7 passed` + +- [ ] **Step 5: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): role detection (sender/receiver/any)" +``` + +--- + +## Task 5: Date parsing + +**Files:** +- Modify: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +Direction is detected from the token immediately before each DATE span. For "zwischen/between/entre", both DATE spans form the range (sorted so earlier = `dateFrom`). A bare year with no direction token produces a closed year-range (`dateFrom` = Jan 1, `dateTo` = Dec 31). + +Note: "nach" appears in both `_RECEIVER_PREPS["de"]` and the date-after set. This is safe — role detection only examines tokens before PER spans; date parsing only examines tokens before DATE spans. They operate on different span types. + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Date parsing ───────────────────────────────────────────────────────────── + +def test_date_vor_1920(nlp_de): + from extractor import extract_dates + # "Briefe vor 1920" — "1920" at chars 11..15 + doc = _make_doc_with_ents(nlp_de, "Briefe vor 1920", [(11, 15, "DATE")]) + date_from, date_to = extract_dates(doc, "de") + assert date_from is None + assert date_to == "1920-12-31" + + +def test_date_nach_1900(nlp_de): + from extractor import extract_dates + # "Briefe nach 1900" — "1900" at chars 12..16 + doc = _make_doc_with_ents(nlp_de, "Briefe nach 1900", [(12, 16, "DATE")]) + date_from, date_to = extract_dates(doc, "de") + assert date_from == "1900-01-01" + assert date_to is None + + +def test_date_zwischen_1900_und_1920(nlp_de): + from extractor import extract_dates + # "zwischen 1900 und 1920" + # z=0..8, ' '=8, 1900=9..13, ' '=13, u=14..17, ' '=17, 1920=18..22 + doc = _make_doc_with_ents(nlp_de, "zwischen 1900 und 1920", [ + (9, 13, "DATE"), + (18, 22, "DATE"), + ]) + date_from, date_to = extract_dates(doc, "de") + assert date_from == "1900-01-01" + assert date_to == "1920-12-31" + + +def test_date_bare_year_makes_range(nlp_de): + from extractor import extract_dates + # "Briefe 1920" — no direction token → year-range + # B=0..6, ' '=6, 1920=7..11 + doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) + date_from, date_to = extract_dates(doc, "de") + assert date_from == "1920-01-01" + assert date_to == "1920-12-31" + + +def test_date_no_date_entity(nlp_de): + from extractor import extract_dates + doc = _make_doc_with_ents(nlp_de, "Briefe von Opa", []) + date_from, date_to = extract_dates(doc, "de") + assert date_from is None + assert date_to is None + + +def test_date_before_english(nlp_en): + from extractor import extract_dates + # "letters before 1920" — "1920" at chars 15..19 + doc = _make_doc_with_ents(nlp_en, "letters before 1920", [(15, 19, "DATE")]) + date_from, date_to = extract_dates(doc, "en") + assert date_from is None + assert date_to == "1920-12-31" + + +def test_date_after_english(nlp_en): + from extractor import extract_dates + # "letters after 1900" — "1900" at chars 14..18 + doc = _make_doc_with_ents(nlp_en, "letters after 1900", [(14, 18, "DATE")]) + date_from, date_to = extract_dates(doc, "en") + assert date_from == "1900-01-01" + assert date_to is None +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_date_vor_1920 -v +``` + +Expected: `ImportError: cannot import name 'extract_dates' from 'extractor'` + +- [ ] **Step 3: Add date parsing to `extractor.py`** + +Add after `detect_person_role`: + +```python +# ── Step 3: Date parsing ───────────────────────────────────────────────────── + +_YEAR_RE = re.compile(r"^\d{4}$") + +_DATE_BEFORE: dict[str, frozenset[str]] = { + "de": frozenset({"vor"}), + "en": frozenset({"before"}), + "es": frozenset({"antes"}), +} + +_DATE_AFTER: dict[str, frozenset[str]] = { + "de": frozenset({"nach"}), + "en": frozenset({"after"}), + "es": frozenset({"después", "despues"}), +} + +_DATE_BETWEEN: dict[str, frozenset[str]] = { + "de": frozenset({"zwischen"}), + "en": frozenset({"between"}), + "es": frozenset({"entre"}), +} + + +def _parse_date_text(text: str, lang: str) -> date | None: + text = text.strip() + if _YEAR_RE.match(text): + year = int(text) + if 1000 < year < 3000: + return date(year, 1, 1) + parsed = dateparser.parse( + text, + languages=[lang], + settings={"PREFER_DAY_OF_MONTH": "first", "RETURN_AS_TIMEZONE_AWARE": False}, + ) + return parsed.date() if parsed else None + + +def _year_end(d: date) -> date: + """If d is Jan 1, return Dec 31 of the same year (year-only boundary).""" + if d.month == 1 and d.day == 1: + return date(d.year, 12, 31) + return d + + +def extract_dates(doc, lang: str) -> tuple[str | None, str | None]: + """Return (date_from, date_to) as ISO strings or None.""" + date_spans = [ent for ent in doc.ents if ent.label_ == "DATE"] + if not date_spans: + return None, None + + between_tokens = _DATE_BETWEEN[lang] + before_tokens = _DATE_BEFORE[lang] + after_tokens = _DATE_AFTER[lang] + + # "zwischen X und Y" / "between X and Y" — two DATE spans form a range + has_between = any(tok.lower_ in between_tokens for tok in doc) + if has_between and len(date_spans) >= 2: + parsed = [] + for span in date_spans[:2]: + d = _parse_date_text(span.text, lang) + if d: + parsed.append(d) + if len(parsed) == 2: + parsed.sort() + return parsed[0].isoformat(), _year_end(parsed[1]).isoformat() + + # Single DATE span — use direction token + span = date_spans[0] + d = _parse_date_text(span.text, lang) + if not d: + return None, None + + prev_lower = doc[span.start - 1].lower_ if span.start > 0 else "" + + if prev_lower in before_tokens: + return None, _year_end(d).isoformat() + if prev_lower in after_tokens: + return d.isoformat(), None + # Bare year/date — closed year-range + return d.isoformat(), _year_end(d).isoformat() +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_date_vor_1920 \ + test_extractor.py::test_date_nach_1900 \ + test_extractor.py::test_date_zwischen_1900_und_1920 \ + test_extractor.py::test_date_bare_year_makes_range \ + test_extractor.py::test_date_no_date_entity \ + test_extractor.py::test_date_before_english \ + test_extractor.py::test_date_after_english -v +``` + +Expected: `7 passed` + +- [ ] **Step 5: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): date range extraction with direction detection" +``` + +--- + +## Task 6: Keyword extraction + +**Files:** +- Modify: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +Keywords are POS-filtered content words (NOUN or PROPN, non-stop, length ≥ 3, not inside any NER span). These are passed to Java's `resolveTags()` which fuzzy-matches them against the tag table — no tag lookup in Python. + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Keyword extraction ─────────────────────────────────────────────────────── + +def test_keywords_extracts_nouns(nlp_de): + from extractor import extract_keywords + # Use real NLP for POS tags; disable NER to control entities manually + doc = nlp_de("Briefe aus dem Krieg", disable=["ner"]) + keywords = extract_keywords(doc, []) + # "Brief" (NOUN, lemma "Brief") and "Krieg" (NOUN) should appear + assert "brief" in keywords + assert "krieg" in keywords + + +def test_keywords_excludes_stopwords(nlp_de): + from extractor import extract_keywords + doc = nlp_de("Briefe aus dem Krieg", disable=["ner"]) + keywords = extract_keywords(doc, []) + # "dem" is a stopword article (DET) — must not appear + assert "dem" not in keywords + + +def test_keywords_excludes_per_ner_spans(nlp_de): + from extractor import extract_keywords + # Run full NLP so POS tagger fires, then inject PER span over "Hermann" + doc = nlp_de("Briefe von Hermann") + per_span = doc.char_span(11, 18, label="PER") # "Hermann" = 11..18 + if per_span: + doc.ents = [per_span] + keywords = extract_keywords(doc, list(doc.ents)) + assert "hermann" not in keywords + + +def test_keywords_excludes_short_lemmas(nlp_de): + from extractor import extract_keywords + # Single-letter / two-letter words should be excluded (length < 3) + doc = nlp_de("Briefe an ihn", disable=["ner"]) + keywords = extract_keywords(doc, []) + assert "ihn" not in keywords + + +def test_keywords_deduplicates(nlp_de): + from extractor import extract_keywords + doc = nlp_de("Brief Brief Krieg", disable=["ner"]) + keywords = extract_keywords(doc, []) + assert keywords.count("brief") == 1 +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_keywords_extracts_nouns -v +``` + +Expected: `ImportError: cannot import name 'extract_keywords' from 'extractor'` + +- [ ] **Step 3: Add keyword extraction to `extractor.py`** + +Add after `extract_dates`: + +```python +# ── Step 4: Keyword extraction ─────────────────────────────────────────────── + +def extract_keywords(doc, excluded_spans: list) -> list[str]: + """Return lowercased lemmas of content words not inside any NER span.""" + excluded_indices: set[int] = set() + for span in excluded_spans: + excluded_indices.update(range(span.start, span.end)) + + seen: set[str] = set() + keywords: list[str] = [] + for token in doc: + if token.i in excluded_indices: + continue + if token.pos_ not in ("NOUN", "PROPN"): + continue + if token.is_stop: + continue + lemma = token.lemma_.lower() + if len(lemma) < 3: + continue + if lemma not in seen: + seen.add(lemma) + keywords.append(lemma) + + return keywords +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_keywords_extracts_nouns \ + test_extractor.py::test_keywords_excludes_stopwords \ + test_extractor.py::test_keywords_excludes_per_ner_spans \ + test_extractor.py::test_keywords_excludes_short_lemmas \ + test_extractor.py::test_keywords_deduplicates -v +``` + +Expected: `5 passed` + +- [ ] **Step 5: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas)" +``` + +--- + +## Task 7: Full `extract()` function + +**Files:** +- Modify: `nlp-service/extractor.py` +- Modify: `nlp-service/test_extractor.py` + +This assembles all steps. Tests here use **real NLP** (no synthetic docs) to validate actual extraction quality. + +- [ ] **Step 1: Write the failing tests** + +Append to `nlp-service/test_extractor.py`: + +```python +# ── Full extract() pipeline ────────────────────────────────────────────────── + +def test_extract_dates_de(): + from extractor import extract + result = extract("Briefe vor 1920", "de") + assert result.dateFrom is None + assert result.dateTo == "1920-12-31" + assert result.rawQuery == "Briefe vor 1920" + assert result.personNames == [] + assert result.personRole == "any" + + +def test_extract_keywords_from_topic_de(): + from extractor import extract + result = extract("Briefe aus dem Krieg", "de") + assert "krieg" in result.keywords + assert result.dateFrom is None + assert result.dateTo is None + + +def test_extract_dates_en(): + from extractor import extract + result = extract("letters before 1920", "en") + assert result.dateTo == "1920-12-31" + assert result.dateFrom is None + + +def test_extract_dates_es(): + from extractor import extract + result = extract("cartas antes de 1920", "es") + assert result.dateTo == "1920-12-31" + assert result.dateFrom is None + + +def test_extract_rawquery_echoed(): + from extractor import extract + q = "Texte über Weihnachten" + result = extract(q, "de") + assert result.rawQuery == q + + +def test_extract_response_fields_are_complete(): + from extractor import extract + result = extract("Briefe 1900", "de") + assert isinstance(result.personNames, list) + assert result.personRole in ("sender", "receiver", "any") + assert isinstance(result.keywords, list) + assert result.rawQuery == "Briefe 1900" +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_extractor.py::test_extract_dates_de -v +``` + +Expected: `ImportError: cannot import name 'extract' from 'extractor'` + +- [ ] **Step 3: Add `extract()` to `extractor.py`** + +Add at the bottom of `extractor.py`: + +```python +# ── Step 5: Assembly ───────────────────────────────────────────────────────── + +def extract(query: str, lang: str) -> ParseResponse: + """Run the full NLP pipeline and return a ParseResponse.""" + nlp = get_nlp(lang) + doc = nlp(query) + + per_spans = [ent for ent in doc.ents if ent.label_ == "PER"] + + person_names = extract_person_names(doc) + person_role = detect_person_role(doc, per_spans, lang) + date_from, date_to = extract_dates(doc, lang) + keywords = extract_keywords(doc, list(doc.ents)) + + return ParseResponse( + personNames=person_names, + personRole=person_role, + dateFrom=date_from, + dateTo=date_to, + keywords=keywords, + rawQuery=query, + ) +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_extractor.py::test_extract_dates_de \ + test_extractor.py::test_extract_keywords_from_topic_de \ + test_extractor.py::test_extract_dates_en \ + test_extractor.py::test_extract_dates_es \ + test_extractor.py::test_extract_rawquery_echoed \ + test_extractor.py::test_extract_response_fields_are_complete -v +``` + +Expected: `6 passed` + +- [ ] **Step 5: Run the full test suite to confirm no regressions** + +```bash +pytest test_extractor.py -v +``` + +Expected: all tests pass + +- [ ] **Step 6: Commit** + +```bash +git add nlp-service/extractor.py nlp-service/test_extractor.py +git commit -m "feat(nlp-service): full extract() pipeline — assembles all steps" +``` + +--- + +## Task 8: FastAPI app + +**Files:** +- Create: `nlp-service/main.py` +- Create: `nlp-service/test_main.py` + +- [ ] **Step 1: Write the failing tests** + +Create `nlp-service/test_main.py`: + +```python +import pytest +from fastapi.testclient import TestClient + + +@pytest.fixture(scope="session") +def client(): + from main import app + with TestClient(app) as c: + yield c + + +def test_health(client): + response = client.get("/health") + assert response.status_code == 200 + assert response.json() == {"status": "ok"} + + +def test_parse_returns_200_with_all_fields(client): + response = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"}) + assert response.status_code == 200 + data = response.json() + assert "personNames" in data + assert "personRole" in data + assert data["personRole"] in ("sender", "receiver", "any") + assert "dateFrom" in data + assert "dateTo" in data + assert "keywords" in data + assert "rawQuery" in data + assert data["rawQuery"] == "Briefe vor 1920" + assert data["dateTo"] == "1920-12-31" + + +def test_parse_unknown_lang_returns_422(client): + response = client.post("/parse", json={"query": "test", "lang": "fr"}) + assert response.status_code == 422 + + +def test_parse_missing_query_returns_422(client): + response = client.post("/parse", json={"lang": "de"}) + assert response.status_code == 422 + + +def test_parse_all_languages(client): + cases = [ + ("de", "Briefe vor 1920"), + ("en", "letters before 1920"), + ("es", "cartas antes de 1920"), + ] + for lang, query in cases: + response = client.post("/parse", json={"query": query, "lang": lang}) + assert response.status_code == 200, f"Failed for lang={lang}" + assert response.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}" +``` + +- [ ] **Step 2: Run to confirm failure** + +```bash +pytest test_main.py::test_health -v +``` + +Expected: `ModuleNotFoundError: No module named 'main'` + +- [ ] **Step 3: Create `nlp-service/main.py`** + +```python +import logging +from contextlib import asynccontextmanager + +from fastapi import FastAPI, HTTPException + +from extractor import extract, load_all_models +from models import ParseRequest, ParseResponse + +logger = logging.getLogger(__name__) + + +@asynccontextmanager +async def lifespan(app: FastAPI): + logger.info("Loading spaCy models...") + load_all_models() + logger.info("All models ready.") + yield + + +app = FastAPI(lifespan=lifespan) + + +@app.get("/health") +def health() -> dict: + return {"status": "ok"} + + +@app.post("/parse", response_model=ParseResponse) +def parse(request: ParseRequest) -> ParseResponse: + try: + return extract(request.query, request.lang) + except Exception as exc: + logger.exception("Extraction pipeline failed") + raise HTTPException(status_code=500, detail=str(exc)) from exc +``` + +- [ ] **Step 4: Run tests to confirm they pass** + +```bash +pytest test_main.py -v +``` + +Expected: `5 passed` + +- [ ] **Step 5: Run the full test suite** + +```bash +pytest -v +``` + +Expected: all tests pass + +- [ ] **Step 6: Smoke-test the running service** + +```bash +uvicorn main:app --reload --port 8001 & +sleep 2 + +curl -s http://localhost:8001/health +# Expected: {"status":"ok"} + +curl -s -X POST http://localhost:8001/parse \ + -H "Content-Type: application/json" \ + -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' | python3 -m json.tool + +# Expected (spaCy may or may not catch "Opa Hermann"/"Marie" as PER): +# { +# "personNames": [...], +# "personRole": "any", +# "dateFrom": null, +# "dateTo": "1920-12-31", +# "keywords": ["brief"], +# "rawQuery": "Briefe von Opa Hermann an Marie vor 1920" +# } + +kill %1 +``` + +- [ ] **Step 7: Commit** + +```bash +git add nlp-service/main.py nlp-service/test_main.py +git commit -m "feat(nlp-service): FastAPI app with /parse and /health endpoints" +``` + +--- + +## Task 9: Dockerfile + +**Files:** +- Create: `nlp-service/Dockerfile` + +- [ ] **Step 1: Create `nlp-service/Dockerfile`** + +```dockerfile +FROM python:3.11-slim + +WORKDIR /app + +RUN apt-get update && apt-get install -y --no-install-recommends \ + curl \ + && rm -rf /var/lib/apt/lists/* + +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Bake models into the image — no volume needed, ~350 MB total +RUN python -m spacy download de_core_news_sm \ + && python -m spacy download en_core_web_sm \ + && python -m spacy download es_core_news_sm + +COPY . . + +RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \ + && chown -R nlp:nlp /app + +USER nlp + +EXPOSE 8001 + +HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \ + CMD curl -f http://localhost:8001/health || exit 1 + +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"] +``` + +- [ ] **Step 2: Build the image** + +```bash +cd nlp-service +docker build -t nlp-service:prototype . +``` + +Expected: build completes, image ~350 MB + +- [ ] **Step 3: Run and smoke-test the container** + +```bash +docker run --rm -d -p 8001:8001 --name nlp-test nlp-service:prototype +sleep 5 + +curl -s http://localhost:8001/health +# Expected: {"status":"ok"} + +curl -s -X POST http://localhost:8001/parse \ + -H "Content-Type: application/json" \ + -d '{"query": "Briefe aus dem Krieg", "lang": "de"}' | python3 -m json.tool + +docker stop nlp-test +``` + +- [ ] **Step 4: Commit** + +```bash +git add nlp-service/Dockerfile +git commit -m "feat(nlp-service): Dockerfile — python:3.11-slim, models baked in" +```