1258 lines
36 KiB
Markdown
1258 lines
36 KiB
Markdown
# spaCy NLP Service Prototype — Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Build `nlp-service/` — a FastAPI service that parses free-text search queries into structured extractions (person names, role, dates, keywords) using spaCy, as a drop-in replacement for the current Ollama service.
|
|
|
|
**Architecture:** Five-step pipeline (NER → role detection → date parsing → keyword extraction → assembly) in `extractor.py`. `main.py` exposes `/parse` and `/health` via FastAPI. Models baked into the Docker image at build time — no volume needed.
|
|
|
|
**Tech Stack:** Python 3.11, FastAPI 0.115, spaCy 3.8 (`de_core_news_sm` / `en_core_web_sm` / `es_core_news_sm`), dateparser 1.2, pytest
|
|
|
|
---
|
|
|
|
## File Map
|
|
|
|
| File | Responsibility |
|
|
|---|---|
|
|
| `nlp-service/models.py` | Pydantic request/response types — the extraction contract |
|
|
| `nlp-service/extractor.py` | NLP pipeline: model loading + 5 extraction steps |
|
|
| `nlp-service/main.py` | FastAPI app — `/parse`, `/health`, lifespan model loading |
|
|
| `nlp-service/requirements.txt` | Python dependencies |
|
|
| `nlp-service/Dockerfile` | Image — python:3.11-slim, models baked in, non-root user |
|
|
| `nlp-service/CLAUDE.md` | Service-level docs |
|
|
| `nlp-service/test_extractor.py` | Unit + integration tests for the pipeline |
|
|
| `nlp-service/test_main.py` | HTTP contract tests for the FastAPI endpoints |
|
|
|
|
---
|
|
|
|
## Task 1: Scaffold — requirements.txt, CLAUDE.md, models.py
|
|
|
|
**Files:**
|
|
- Create: `nlp-service/requirements.txt`
|
|
- Create: `nlp-service/CLAUDE.md`
|
|
- Create: `nlp-service/models.py`
|
|
- Create: `nlp-service/test_extractor.py` (skeleton only)
|
|
|
|
- [ ] **Step 1: Create `nlp-service/requirements.txt`**
|
|
|
|
```
|
|
fastapi[standard]==0.115.6
|
|
uvicorn[standard]==0.34.0
|
|
spacy>=3.8,<4.0
|
|
dateparser>=1.2,<2.0
|
|
pytest>=8.0,<9.0
|
|
httpx>=0.28,<1.0
|
|
```
|
|
|
|
- [ ] **Step 2: Create `nlp-service/CLAUDE.md`**
|
|
|
|
```markdown
|
|
# NLP Service
|
|
|
|
Lightweight FastAPI service that parses free-text search queries into structured extractions,
|
|
replacing Ollama for the Familienarchiv NL search feature.
|
|
|
|
## Stack
|
|
|
|
- Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2
|
|
|
|
## Endpoints
|
|
|
|
- `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract
|
|
- `GET /health` — returns `{"status": "ok"}` when all models are loaded
|
|
|
|
## Running locally
|
|
|
|
\`\`\`bash
|
|
pip install -r requirements.txt
|
|
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
|
|
uvicorn main:app --reload --port 8001
|
|
|
|
curl -X POST http://localhost:8001/parse \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'
|
|
\`\`\`
|
|
|
|
## Testing
|
|
|
|
\`\`\`bash
|
|
pytest -v
|
|
\`\`\`
|
|
|
|
## Design spec
|
|
|
|
See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
|
|
|
|
## Notes
|
|
|
|
This is a **prototype** for extraction quality evaluation. No docker-compose integration or
|
|
Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in
|
|
`backend/src/main/java/org/raddatz/familienarchiv/search/`.
|
|
```
|
|
|
|
- [ ] **Step 3: Write the failing test for Pydantic models**
|
|
|
|
Create `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
import pytest
|
|
from pydantic import ValidationError
|
|
|
|
|
|
# ── Models ──────────────────────────────────────────────────────────────────
|
|
|
|
def test_parse_request_valid():
|
|
from models import ParseRequest
|
|
req = ParseRequest(query="Briefe von Opa", lang="de")
|
|
assert req.query == "Briefe von Opa"
|
|
assert req.lang == "de"
|
|
|
|
|
|
def test_parse_request_rejects_unknown_lang():
|
|
from models import ParseRequest
|
|
with pytest.raises(ValidationError):
|
|
ParseRequest(query="Letters from grandpa", lang="fr")
|
|
|
|
|
|
def test_parse_response_serializes_nulls():
|
|
from models import ParseResponse
|
|
resp = ParseResponse(
|
|
personNames=["Opa"],
|
|
personRole="sender",
|
|
dateFrom=None,
|
|
dateTo="1920-12-31",
|
|
keywords=["brief"],
|
|
rawQuery="Briefe von Opa",
|
|
)
|
|
data = resp.model_dump()
|
|
assert data["dateFrom"] is None
|
|
assert data["dateTo"] == "1920-12-31"
|
|
assert data["personRole"] == "sender"
|
|
```
|
|
|
|
- [ ] **Step 4: Run to confirm failure**
|
|
|
|
```bash
|
|
cd nlp-service
|
|
pip install -r requirements.txt
|
|
pytest test_extractor.py::test_parse_request_valid -v
|
|
```
|
|
|
|
Expected: `ModuleNotFoundError: No module named 'models'`
|
|
|
|
- [ ] **Step 5: Create `nlp-service/models.py`**
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
from typing import Literal
|
|
from pydantic import BaseModel
|
|
|
|
|
|
class ParseRequest(BaseModel):
|
|
query: str
|
|
lang: Literal["de", "en", "es"]
|
|
|
|
|
|
class ParseResponse(BaseModel):
|
|
personNames: list[str]
|
|
personRole: Literal["sender", "receiver", "any"]
|
|
dateFrom: str | None
|
|
dateTo: str | None
|
|
keywords: list[str]
|
|
rawQuery: str
|
|
```
|
|
|
|
- [ ] **Step 6: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_parse_request_valid \
|
|
test_extractor.py::test_parse_request_rejects_unknown_lang \
|
|
test_extractor.py::test_parse_response_serializes_nulls -v
|
|
```
|
|
|
|
Expected: `3 passed`
|
|
|
|
- [ ] **Step 7: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/
|
|
git commit -m "feat(nlp-service): scaffold — models, requirements, CLAUDE.md"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 2: spaCy model loading
|
|
|
|
**Files:**
|
|
- Create: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
Before running these tests, the three spaCy models must be installed:
|
|
|
|
```bash
|
|
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
|
|
```
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Model loading ────────────────────────────────────────────────────────────
|
|
|
|
import pytest
|
|
|
|
|
|
@pytest.fixture(scope="session")
|
|
def nlp_de():
|
|
from extractor import get_nlp
|
|
return get_nlp("de")
|
|
|
|
|
|
@pytest.fixture(scope="session")
|
|
def nlp_en():
|
|
from extractor import get_nlp
|
|
return get_nlp("en")
|
|
|
|
|
|
@pytest.fixture(scope="session")
|
|
def nlp_es():
|
|
from extractor import get_nlp
|
|
return get_nlp("es")
|
|
|
|
|
|
def test_get_nlp_de_loads(nlp_de):
|
|
doc = nlp_de("Test")
|
|
assert doc is not None
|
|
|
|
|
|
def test_get_nlp_en_loads(nlp_en):
|
|
doc = nlp_en("Test")
|
|
assert doc is not None
|
|
|
|
|
|
def test_get_nlp_es_loads(nlp_es):
|
|
doc = nlp_es("Prueba")
|
|
assert doc is not None
|
|
|
|
|
|
def test_get_nlp_unknown_lang_raises():
|
|
from extractor import get_nlp
|
|
with pytest.raises(ValueError, match="Unsupported language"):
|
|
get_nlp("fr")
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_get_nlp_de_loads -v
|
|
```
|
|
|
|
Expected: `ModuleNotFoundError: No module named 'extractor'`
|
|
|
|
- [ ] **Step 3: Create `nlp-service/extractor.py` with model loading**
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from datetime import date
|
|
|
|
import dateparser
|
|
import spacy
|
|
from spacy.language import Language
|
|
|
|
from models import ParseResponse
|
|
|
|
# ── Language model registry ──────────────────────────────────────────────────
|
|
|
|
_MODEL_NAMES: dict[str, str] = {
|
|
"de": "de_core_news_sm",
|
|
"en": "en_core_web_sm",
|
|
"es": "es_core_news_sm",
|
|
}
|
|
|
|
_nlp_cache: dict[str, Language] = {}
|
|
|
|
|
|
def get_nlp(lang: str) -> Language:
|
|
if lang not in _MODEL_NAMES:
|
|
raise ValueError(f"Unsupported language: {lang!r}. Valid: {list(_MODEL_NAMES)}")
|
|
if lang not in _nlp_cache:
|
|
_nlp_cache[lang] = spacy.load(_MODEL_NAMES[lang])
|
|
return _nlp_cache[lang]
|
|
|
|
|
|
def load_all_models() -> None:
|
|
for lang in _MODEL_NAMES:
|
|
get_nlp(lang)
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_get_nlp_de_loads \
|
|
test_extractor.py::test_get_nlp_en_loads \
|
|
test_extractor.py::test_get_nlp_es_loads \
|
|
test_extractor.py::test_get_nlp_unknown_lang_raises -v
|
|
```
|
|
|
|
Expected: `4 passed`
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): spaCy model loading with get_nlp/load_all_models"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 3: Person name extraction (NER)
|
|
|
|
**Files:**
|
|
- Modify: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Person name extraction ───────────────────────────────────────────────────
|
|
|
|
def _make_doc_with_ents(nlp, text: str, char_ents: list[tuple[int, int, str]]):
|
|
"""Create a Doc with manually injected entity spans (no NER model needed)."""
|
|
doc = nlp.make_doc(text)
|
|
spans = [doc.char_span(s, e, label=lbl) for s, e, lbl in char_ents]
|
|
doc.ents = [sp for sp in spans if sp is not None]
|
|
return doc
|
|
|
|
|
|
def test_extract_person_names_two_persons(nlp_de):
|
|
from extractor import extract_person_names
|
|
# "Briefe von Opa Hermann an Marie"
|
|
# 0123456789012345678901234567890
|
|
# 1111111111222222222233
|
|
# "Opa Hermann" = 11..22, "Marie" = 26..31
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa Hermann an Marie", [
|
|
(11, 22, "PER"),
|
|
(26, 31, "PER"),
|
|
])
|
|
assert extract_person_names(doc) == ["Opa Hermann", "Marie"]
|
|
|
|
|
|
def test_extract_person_names_preserves_order(nlp_de):
|
|
from extractor import extract_person_names
|
|
# Reversed: "Marie von Opa" — Marie comes first in text
|
|
# "Marie" = 0..5, "Opa" = 10..13
|
|
doc = _make_doc_with_ents(nlp_de, "Marie von Opa", [
|
|
(0, 5, "PER"),
|
|
(10, 13, "PER"),
|
|
])
|
|
assert extract_person_names(doc) == ["Marie", "Opa"]
|
|
|
|
|
|
def test_extract_person_names_empty(nlp_de):
|
|
from extractor import extract_person_names
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe aus dem Krieg", [])
|
|
assert extract_person_names(doc) == []
|
|
|
|
|
|
def test_extract_person_names_ignores_non_per(nlp_de):
|
|
from extractor import extract_person_names
|
|
# DATE entity should not appear in personNames
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")])
|
|
assert extract_person_names(doc) == []
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_extract_person_names_two_persons -v
|
|
```
|
|
|
|
Expected: `ImportError: cannot import name 'extract_person_names' from 'extractor'`
|
|
|
|
- [ ] **Step 3: Add `extract_person_names` to `extractor.py`**
|
|
|
|
Add after the model loading section:
|
|
|
|
```python
|
|
# ── Step 1: Person name extraction ──────────────────────────────────────────
|
|
|
|
def extract_person_names(doc) -> list[str]:
|
|
"""Return PER entity texts in left-to-right span order."""
|
|
return [ent.text for ent in doc.ents if ent.label_ == "PER"]
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_extract_person_names_two_persons \
|
|
test_extractor.py::test_extract_person_names_preserves_order \
|
|
test_extractor.py::test_extract_person_names_empty \
|
|
test_extractor.py::test_extract_person_names_ignores_non_per -v
|
|
```
|
|
|
|
Expected: `4 passed`
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): NER person name extraction"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 4: Role detection
|
|
|
|
**Files:**
|
|
- Modify: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
Role is only meaningful when exactly one PER entity is found. The function checks:
|
|
1. Dependency-tree children of the PER span's root with `dep_` in `("case", "prep", "mo")`
|
|
2. Fallback: the token immediately before the span
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Role detection ───────────────────────────────────────────────────────────
|
|
|
|
def test_role_sender_von(nlp_de):
|
|
from extractor import detect_person_role
|
|
# "Briefe von Marie" — "von" immediately before "Marie"
|
|
# B=0..6, ' '=6, v=7..10, ' '=10, M=11..16
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe von Marie", [(11, 16, "PER")])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "de") == "sender"
|
|
|
|
|
|
def test_role_receiver_an(nlp_de):
|
|
from extractor import detect_person_role
|
|
# "Briefe an Marie" — "an" immediately before "Marie"
|
|
# B=0..6, ' '=6, a=7..9, ' '=9, M=10..15
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe an Marie", [(10, 15, "PER")])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "de") == "receiver"
|
|
|
|
|
|
def test_role_two_persons_returns_any(nlp_de):
|
|
from extractor import detect_person_role
|
|
# "von Opa an Marie" — two PER spans → always "any"
|
|
# v=0..3, ' '=3, O=4..7, ' '=7, a=8..10, ' '=10, M=11..16
|
|
doc = _make_doc_with_ents(nlp_de, "von Opa an Marie", [
|
|
(4, 7, "PER"),
|
|
(11, 16, "PER"),
|
|
])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "de") == "any"
|
|
|
|
|
|
def test_role_no_prep_returns_any(nlp_de):
|
|
from extractor import detect_person_role
|
|
# "Briefe Marie" — no preposition
|
|
# B=0..6, ' '=6, M=7..12
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe Marie", [(7, 12, "PER")])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "de") == "any"
|
|
|
|
|
|
def test_role_empty_returns_any(nlp_de):
|
|
from extractor import detect_person_role
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [])
|
|
assert detect_person_role(doc, [], "de") == "any"
|
|
|
|
|
|
def test_role_sender_from_english(nlp_en):
|
|
from extractor import detect_person_role
|
|
# "letters from Marie" — "from" before "Marie"
|
|
# l=0..7, ' '=7, f=8..12, ' '=12, M=13..18
|
|
doc = _make_doc_with_ents(nlp_en, "letters from Marie", [(13, 18, "PER")])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "en") == "sender"
|
|
|
|
|
|
def test_role_receiver_to_english(nlp_en):
|
|
from extractor import detect_person_role
|
|
# "letters to Marie"
|
|
# l=0..7, ' '=7, t=8..10, ' '=10, M=11..16
|
|
doc = _make_doc_with_ents(nlp_en, "letters to Marie", [(11, 16, "PER")])
|
|
per_spans = list(doc.ents)
|
|
assert detect_person_role(doc, per_spans, "en") == "receiver"
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_role_sender_von -v
|
|
```
|
|
|
|
Expected: `ImportError: cannot import name 'detect_person_role' from 'extractor'`
|
|
|
|
- [ ] **Step 3: Add role detection constants and function to `extractor.py`**
|
|
|
|
Add after `extract_person_names`:
|
|
|
|
```python
|
|
# ── Step 2: Role detection ───────────────────────────────────────────────────
|
|
|
|
_SENDER_PREPS: dict[str, frozenset[str]] = {
|
|
"de": frozenset({"von", "vom"}),
|
|
"en": frozenset({"from", "by"}),
|
|
"es": frozenset({"de", "por"}),
|
|
}
|
|
|
|
_RECEIVER_PREPS: dict[str, frozenset[str]] = {
|
|
"de": frozenset({"an", "nach", "für"}),
|
|
"en": frozenset({"to", "for"}),
|
|
"es": frozenset({"para", "a"}),
|
|
}
|
|
|
|
|
|
def detect_person_role(doc, per_spans: list, lang: str) -> str:
|
|
"""Return 'sender', 'receiver', or 'any'.
|
|
|
|
Only meaningful for single-PER queries — two-person queries always return
|
|
'any' because Java derives direction from list position.
|
|
"""
|
|
if len(per_spans) != 1:
|
|
return "any"
|
|
|
|
span = per_spans[0]
|
|
root = span.root
|
|
sender = _SENDER_PREPS[lang]
|
|
receiver = _RECEIVER_PREPS[lang]
|
|
|
|
# Primary: dependency-tree children of the PER root
|
|
for child in root.children:
|
|
if child.dep_ in ("case", "prep", "mo"):
|
|
if child.lower_ in sender:
|
|
return "sender"
|
|
if child.lower_ in receiver:
|
|
return "receiver"
|
|
|
|
# Fallback: token immediately before the span start
|
|
if span.start > 0:
|
|
prev = doc[span.start - 1]
|
|
if prev.lower_ in sender:
|
|
return "sender"
|
|
if prev.lower_ in receiver:
|
|
return "receiver"
|
|
|
|
return "any"
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_role_sender_von \
|
|
test_extractor.py::test_role_receiver_an \
|
|
test_extractor.py::test_role_two_persons_returns_any \
|
|
test_extractor.py::test_role_no_prep_returns_any \
|
|
test_extractor.py::test_role_empty_returns_any \
|
|
test_extractor.py::test_role_sender_from_english \
|
|
test_extractor.py::test_role_receiver_to_english -v
|
|
```
|
|
|
|
Expected: `7 passed`
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): role detection (sender/receiver/any)"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 5: Date parsing
|
|
|
|
**Files:**
|
|
- Modify: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
Direction is detected from the token immediately before each DATE span. For "zwischen/between/entre", both DATE spans form the range (sorted so earlier = `dateFrom`). A bare year with no direction token produces a closed year-range (`dateFrom` = Jan 1, `dateTo` = Dec 31).
|
|
|
|
Note: "nach" appears in both `_RECEIVER_PREPS["de"]` and the date-after set. This is safe — role detection only examines tokens before PER spans; date parsing only examines tokens before DATE spans. They operate on different span types.
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Date parsing ─────────────────────────────────────────────────────────────
|
|
|
|
def test_date_vor_1920(nlp_de):
|
|
from extractor import extract_dates
|
|
# "Briefe vor 1920" — "1920" at chars 11..15
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe vor 1920", [(11, 15, "DATE")])
|
|
date_from, date_to = extract_dates(doc, "de")
|
|
assert date_from is None
|
|
assert date_to == "1920-12-31"
|
|
|
|
|
|
def test_date_nach_1900(nlp_de):
|
|
from extractor import extract_dates
|
|
# "Briefe nach 1900" — "1900" at chars 12..16
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe nach 1900", [(12, 16, "DATE")])
|
|
date_from, date_to = extract_dates(doc, "de")
|
|
assert date_from == "1900-01-01"
|
|
assert date_to is None
|
|
|
|
|
|
def test_date_zwischen_1900_und_1920(nlp_de):
|
|
from extractor import extract_dates
|
|
# "zwischen 1900 und 1920"
|
|
# z=0..8, ' '=8, 1900=9..13, ' '=13, u=14..17, ' '=17, 1920=18..22
|
|
doc = _make_doc_with_ents(nlp_de, "zwischen 1900 und 1920", [
|
|
(9, 13, "DATE"),
|
|
(18, 22, "DATE"),
|
|
])
|
|
date_from, date_to = extract_dates(doc, "de")
|
|
assert date_from == "1900-01-01"
|
|
assert date_to == "1920-12-31"
|
|
|
|
|
|
def test_date_bare_year_makes_range(nlp_de):
|
|
from extractor import extract_dates
|
|
# "Briefe 1920" — no direction token → year-range
|
|
# B=0..6, ' '=6, 1920=7..11
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")])
|
|
date_from, date_to = extract_dates(doc, "de")
|
|
assert date_from == "1920-01-01"
|
|
assert date_to == "1920-12-31"
|
|
|
|
|
|
def test_date_no_date_entity(nlp_de):
|
|
from extractor import extract_dates
|
|
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa", [])
|
|
date_from, date_to = extract_dates(doc, "de")
|
|
assert date_from is None
|
|
assert date_to is None
|
|
|
|
|
|
def test_date_before_english(nlp_en):
|
|
from extractor import extract_dates
|
|
# "letters before 1920" — "1920" at chars 15..19
|
|
doc = _make_doc_with_ents(nlp_en, "letters before 1920", [(15, 19, "DATE")])
|
|
date_from, date_to = extract_dates(doc, "en")
|
|
assert date_from is None
|
|
assert date_to == "1920-12-31"
|
|
|
|
|
|
def test_date_after_english(nlp_en):
|
|
from extractor import extract_dates
|
|
# "letters after 1900" — "1900" at chars 14..18
|
|
doc = _make_doc_with_ents(nlp_en, "letters after 1900", [(14, 18, "DATE")])
|
|
date_from, date_to = extract_dates(doc, "en")
|
|
assert date_from == "1900-01-01"
|
|
assert date_to is None
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_date_vor_1920 -v
|
|
```
|
|
|
|
Expected: `ImportError: cannot import name 'extract_dates' from 'extractor'`
|
|
|
|
- [ ] **Step 3: Add date parsing to `extractor.py`**
|
|
|
|
Add after `detect_person_role`:
|
|
|
|
```python
|
|
# ── Step 3: Date parsing ─────────────────────────────────────────────────────
|
|
|
|
_YEAR_RE = re.compile(r"^\d{4}$")
|
|
|
|
_DATE_BEFORE: dict[str, frozenset[str]] = {
|
|
"de": frozenset({"vor"}),
|
|
"en": frozenset({"before"}),
|
|
"es": frozenset({"antes"}),
|
|
}
|
|
|
|
_DATE_AFTER: dict[str, frozenset[str]] = {
|
|
"de": frozenset({"nach"}),
|
|
"en": frozenset({"after"}),
|
|
"es": frozenset({"después", "despues"}),
|
|
}
|
|
|
|
_DATE_BETWEEN: dict[str, frozenset[str]] = {
|
|
"de": frozenset({"zwischen"}),
|
|
"en": frozenset({"between"}),
|
|
"es": frozenset({"entre"}),
|
|
}
|
|
|
|
|
|
def _parse_date_text(text: str, lang: str) -> date | None:
|
|
text = text.strip()
|
|
if _YEAR_RE.match(text):
|
|
year = int(text)
|
|
if 1000 < year < 3000:
|
|
return date(year, 1, 1)
|
|
parsed = dateparser.parse(
|
|
text,
|
|
languages=[lang],
|
|
settings={"PREFER_DAY_OF_MONTH": "first", "RETURN_AS_TIMEZONE_AWARE": False},
|
|
)
|
|
return parsed.date() if parsed else None
|
|
|
|
|
|
def _year_end(d: date) -> date:
|
|
"""If d is Jan 1, return Dec 31 of the same year (year-only boundary)."""
|
|
if d.month == 1 and d.day == 1:
|
|
return date(d.year, 12, 31)
|
|
return d
|
|
|
|
|
|
def extract_dates(doc, lang: str) -> tuple[str | None, str | None]:
|
|
"""Return (date_from, date_to) as ISO strings or None."""
|
|
date_spans = [ent for ent in doc.ents if ent.label_ == "DATE"]
|
|
if not date_spans:
|
|
return None, None
|
|
|
|
between_tokens = _DATE_BETWEEN[lang]
|
|
before_tokens = _DATE_BEFORE[lang]
|
|
after_tokens = _DATE_AFTER[lang]
|
|
|
|
# "zwischen X und Y" / "between X and Y" — two DATE spans form a range
|
|
has_between = any(tok.lower_ in between_tokens for tok in doc)
|
|
if has_between and len(date_spans) >= 2:
|
|
parsed = []
|
|
for span in date_spans[:2]:
|
|
d = _parse_date_text(span.text, lang)
|
|
if d:
|
|
parsed.append(d)
|
|
if len(parsed) == 2:
|
|
parsed.sort()
|
|
return parsed[0].isoformat(), _year_end(parsed[1]).isoformat()
|
|
|
|
# Single DATE span — use direction token
|
|
span = date_spans[0]
|
|
d = _parse_date_text(span.text, lang)
|
|
if not d:
|
|
return None, None
|
|
|
|
prev_lower = doc[span.start - 1].lower_ if span.start > 0 else ""
|
|
|
|
if prev_lower in before_tokens:
|
|
return None, _year_end(d).isoformat()
|
|
if prev_lower in after_tokens:
|
|
return d.isoformat(), None
|
|
# Bare year/date — closed year-range
|
|
return d.isoformat(), _year_end(d).isoformat()
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_date_vor_1920 \
|
|
test_extractor.py::test_date_nach_1900 \
|
|
test_extractor.py::test_date_zwischen_1900_und_1920 \
|
|
test_extractor.py::test_date_bare_year_makes_range \
|
|
test_extractor.py::test_date_no_date_entity \
|
|
test_extractor.py::test_date_before_english \
|
|
test_extractor.py::test_date_after_english -v
|
|
```
|
|
|
|
Expected: `7 passed`
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): date range extraction with direction detection"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 6: Keyword extraction
|
|
|
|
**Files:**
|
|
- Modify: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
Keywords are POS-filtered content words (NOUN or PROPN, non-stop, length ≥ 3, not inside any NER span). These are passed to Java's `resolveTags()` which fuzzy-matches them against the tag table — no tag lookup in Python.
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Keyword extraction ───────────────────────────────────────────────────────
|
|
|
|
def test_keywords_extracts_nouns(nlp_de):
|
|
from extractor import extract_keywords
|
|
# Use real NLP for POS tags; disable NER to control entities manually
|
|
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"])
|
|
keywords = extract_keywords(doc, [])
|
|
# "Brief" (NOUN, lemma "Brief") and "Krieg" (NOUN) should appear
|
|
assert "brief" in keywords
|
|
assert "krieg" in keywords
|
|
|
|
|
|
def test_keywords_excludes_stopwords(nlp_de):
|
|
from extractor import extract_keywords
|
|
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"])
|
|
keywords = extract_keywords(doc, [])
|
|
# "dem" is a stopword article (DET) — must not appear
|
|
assert "dem" not in keywords
|
|
|
|
|
|
def test_keywords_excludes_per_ner_spans(nlp_de):
|
|
from extractor import extract_keywords
|
|
# Run full NLP so POS tagger fires, then inject PER span over "Hermann"
|
|
doc = nlp_de("Briefe von Hermann")
|
|
per_span = doc.char_span(11, 18, label="PER") # "Hermann" = 11..18
|
|
if per_span:
|
|
doc.ents = [per_span]
|
|
keywords = extract_keywords(doc, list(doc.ents))
|
|
assert "hermann" not in keywords
|
|
|
|
|
|
def test_keywords_excludes_short_lemmas(nlp_de):
|
|
from extractor import extract_keywords
|
|
# Single-letter / two-letter words should be excluded (length < 3)
|
|
doc = nlp_de("Briefe an ihn", disable=["ner"])
|
|
keywords = extract_keywords(doc, [])
|
|
assert "ihn" not in keywords
|
|
|
|
|
|
def test_keywords_deduplicates(nlp_de):
|
|
from extractor import extract_keywords
|
|
doc = nlp_de("Brief Brief Krieg", disable=["ner"])
|
|
keywords = extract_keywords(doc, [])
|
|
assert keywords.count("brief") == 1
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_keywords_extracts_nouns -v
|
|
```
|
|
|
|
Expected: `ImportError: cannot import name 'extract_keywords' from 'extractor'`
|
|
|
|
- [ ] **Step 3: Add keyword extraction to `extractor.py`**
|
|
|
|
Add after `extract_dates`:
|
|
|
|
```python
|
|
# ── Step 4: Keyword extraction ───────────────────────────────────────────────
|
|
|
|
def extract_keywords(doc, excluded_spans: list) -> list[str]:
|
|
"""Return lowercased lemmas of content words not inside any NER span."""
|
|
excluded_indices: set[int] = set()
|
|
for span in excluded_spans:
|
|
excluded_indices.update(range(span.start, span.end))
|
|
|
|
seen: set[str] = set()
|
|
keywords: list[str] = []
|
|
for token in doc:
|
|
if token.i in excluded_indices:
|
|
continue
|
|
if token.pos_ not in ("NOUN", "PROPN"):
|
|
continue
|
|
if token.is_stop:
|
|
continue
|
|
lemma = token.lemma_.lower()
|
|
if len(lemma) < 3:
|
|
continue
|
|
if lemma not in seen:
|
|
seen.add(lemma)
|
|
keywords.append(lemma)
|
|
|
|
return keywords
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_keywords_extracts_nouns \
|
|
test_extractor.py::test_keywords_excludes_stopwords \
|
|
test_extractor.py::test_keywords_excludes_per_ner_spans \
|
|
test_extractor.py::test_keywords_excludes_short_lemmas \
|
|
test_extractor.py::test_keywords_deduplicates -v
|
|
```
|
|
|
|
Expected: `5 passed`
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas)"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 7: Full `extract()` function
|
|
|
|
**Files:**
|
|
- Modify: `nlp-service/extractor.py`
|
|
- Modify: `nlp-service/test_extractor.py`
|
|
|
|
This assembles all steps. Tests here use **real NLP** (no synthetic docs) to validate actual extraction quality.
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Append to `nlp-service/test_extractor.py`:
|
|
|
|
```python
|
|
# ── Full extract() pipeline ──────────────────────────────────────────────────
|
|
|
|
def test_extract_dates_de():
|
|
from extractor import extract
|
|
result = extract("Briefe vor 1920", "de")
|
|
assert result.dateFrom is None
|
|
assert result.dateTo == "1920-12-31"
|
|
assert result.rawQuery == "Briefe vor 1920"
|
|
assert result.personNames == []
|
|
assert result.personRole == "any"
|
|
|
|
|
|
def test_extract_keywords_from_topic_de():
|
|
from extractor import extract
|
|
result = extract("Briefe aus dem Krieg", "de")
|
|
assert "krieg" in result.keywords
|
|
assert result.dateFrom is None
|
|
assert result.dateTo is None
|
|
|
|
|
|
def test_extract_dates_en():
|
|
from extractor import extract
|
|
result = extract("letters before 1920", "en")
|
|
assert result.dateTo == "1920-12-31"
|
|
assert result.dateFrom is None
|
|
|
|
|
|
def test_extract_dates_es():
|
|
from extractor import extract
|
|
result = extract("cartas antes de 1920", "es")
|
|
assert result.dateTo == "1920-12-31"
|
|
assert result.dateFrom is None
|
|
|
|
|
|
def test_extract_rawquery_echoed():
|
|
from extractor import extract
|
|
q = "Texte über Weihnachten"
|
|
result = extract(q, "de")
|
|
assert result.rawQuery == q
|
|
|
|
|
|
def test_extract_response_fields_are_complete():
|
|
from extractor import extract
|
|
result = extract("Briefe 1900", "de")
|
|
assert isinstance(result.personNames, list)
|
|
assert result.personRole in ("sender", "receiver", "any")
|
|
assert isinstance(result.keywords, list)
|
|
assert result.rawQuery == "Briefe 1900"
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_extract_dates_de -v
|
|
```
|
|
|
|
Expected: `ImportError: cannot import name 'extract' from 'extractor'`
|
|
|
|
- [ ] **Step 3: Add `extract()` to `extractor.py`**
|
|
|
|
Add at the bottom of `extractor.py`:
|
|
|
|
```python
|
|
# ── Step 5: Assembly ─────────────────────────────────────────────────────────
|
|
|
|
def extract(query: str, lang: str) -> ParseResponse:
|
|
"""Run the full NLP pipeline and return a ParseResponse."""
|
|
nlp = get_nlp(lang)
|
|
doc = nlp(query)
|
|
|
|
per_spans = [ent for ent in doc.ents if ent.label_ == "PER"]
|
|
|
|
person_names = extract_person_names(doc)
|
|
person_role = detect_person_role(doc, per_spans, lang)
|
|
date_from, date_to = extract_dates(doc, lang)
|
|
keywords = extract_keywords(doc, list(doc.ents))
|
|
|
|
return ParseResponse(
|
|
personNames=person_names,
|
|
personRole=person_role,
|
|
dateFrom=date_from,
|
|
dateTo=date_to,
|
|
keywords=keywords,
|
|
rawQuery=query,
|
|
)
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_extractor.py::test_extract_dates_de \
|
|
test_extractor.py::test_extract_keywords_from_topic_de \
|
|
test_extractor.py::test_extract_dates_en \
|
|
test_extractor.py::test_extract_dates_es \
|
|
test_extractor.py::test_extract_rawquery_echoed \
|
|
test_extractor.py::test_extract_response_fields_are_complete -v
|
|
```
|
|
|
|
Expected: `6 passed`
|
|
|
|
- [ ] **Step 5: Run the full test suite to confirm no regressions**
|
|
|
|
```bash
|
|
pytest test_extractor.py -v
|
|
```
|
|
|
|
Expected: all tests pass
|
|
|
|
- [ ] **Step 6: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/extractor.py nlp-service/test_extractor.py
|
|
git commit -m "feat(nlp-service): full extract() pipeline — assembles all steps"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 8: FastAPI app
|
|
|
|
**Files:**
|
|
- Create: `nlp-service/main.py`
|
|
- Create: `nlp-service/test_main.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
Create `nlp-service/test_main.py`:
|
|
|
|
```python
|
|
import pytest
|
|
from fastapi.testclient import TestClient
|
|
|
|
|
|
@pytest.fixture(scope="session")
|
|
def client():
|
|
from main import app
|
|
with TestClient(app) as c:
|
|
yield c
|
|
|
|
|
|
def test_health(client):
|
|
response = client.get("/health")
|
|
assert response.status_code == 200
|
|
assert response.json() == {"status": "ok"}
|
|
|
|
|
|
def test_parse_returns_200_with_all_fields(client):
|
|
response = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"})
|
|
assert response.status_code == 200
|
|
data = response.json()
|
|
assert "personNames" in data
|
|
assert "personRole" in data
|
|
assert data["personRole"] in ("sender", "receiver", "any")
|
|
assert "dateFrom" in data
|
|
assert "dateTo" in data
|
|
assert "keywords" in data
|
|
assert "rawQuery" in data
|
|
assert data["rawQuery"] == "Briefe vor 1920"
|
|
assert data["dateTo"] == "1920-12-31"
|
|
|
|
|
|
def test_parse_unknown_lang_returns_422(client):
|
|
response = client.post("/parse", json={"query": "test", "lang": "fr"})
|
|
assert response.status_code == 422
|
|
|
|
|
|
def test_parse_missing_query_returns_422(client):
|
|
response = client.post("/parse", json={"lang": "de"})
|
|
assert response.status_code == 422
|
|
|
|
|
|
def test_parse_all_languages(client):
|
|
cases = [
|
|
("de", "Briefe vor 1920"),
|
|
("en", "letters before 1920"),
|
|
("es", "cartas antes de 1920"),
|
|
]
|
|
for lang, query in cases:
|
|
response = client.post("/parse", json={"query": query, "lang": lang})
|
|
assert response.status_code == 200, f"Failed for lang={lang}"
|
|
assert response.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}"
|
|
```
|
|
|
|
- [ ] **Step 2: Run to confirm failure**
|
|
|
|
```bash
|
|
pytest test_main.py::test_health -v
|
|
```
|
|
|
|
Expected: `ModuleNotFoundError: No module named 'main'`
|
|
|
|
- [ ] **Step 3: Create `nlp-service/main.py`**
|
|
|
|
```python
|
|
import logging
|
|
from contextlib import asynccontextmanager
|
|
|
|
from fastapi import FastAPI, HTTPException
|
|
|
|
from extractor import extract, load_all_models
|
|
from models import ParseRequest, ParseResponse
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
@asynccontextmanager
|
|
async def lifespan(app: FastAPI):
|
|
logger.info("Loading spaCy models...")
|
|
load_all_models()
|
|
logger.info("All models ready.")
|
|
yield
|
|
|
|
|
|
app = FastAPI(lifespan=lifespan)
|
|
|
|
|
|
@app.get("/health")
|
|
def health() -> dict:
|
|
return {"status": "ok"}
|
|
|
|
|
|
@app.post("/parse", response_model=ParseResponse)
|
|
def parse(request: ParseRequest) -> ParseResponse:
|
|
try:
|
|
return extract(request.query, request.lang)
|
|
except Exception as exc:
|
|
logger.exception("Extraction pipeline failed")
|
|
raise HTTPException(status_code=500, detail=str(exc)) from exc
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to confirm they pass**
|
|
|
|
```bash
|
|
pytest test_main.py -v
|
|
```
|
|
|
|
Expected: `5 passed`
|
|
|
|
- [ ] **Step 5: Run the full test suite**
|
|
|
|
```bash
|
|
pytest -v
|
|
```
|
|
|
|
Expected: all tests pass
|
|
|
|
- [ ] **Step 6: Smoke-test the running service**
|
|
|
|
```bash
|
|
uvicorn main:app --reload --port 8001 &
|
|
sleep 2
|
|
|
|
curl -s http://localhost:8001/health
|
|
# Expected: {"status":"ok"}
|
|
|
|
curl -s -X POST http://localhost:8001/parse \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' | python3 -m json.tool
|
|
|
|
# Expected (spaCy may or may not catch "Opa Hermann"/"Marie" as PER):
|
|
# {
|
|
# "personNames": [...],
|
|
# "personRole": "any",
|
|
# "dateFrom": null,
|
|
# "dateTo": "1920-12-31",
|
|
# "keywords": ["brief"],
|
|
# "rawQuery": "Briefe von Opa Hermann an Marie vor 1920"
|
|
# }
|
|
|
|
kill %1
|
|
```
|
|
|
|
- [ ] **Step 7: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/main.py nlp-service/test_main.py
|
|
git commit -m "feat(nlp-service): FastAPI app with /parse and /health endpoints"
|
|
```
|
|
|
|
---
|
|
|
|
## Task 9: Dockerfile
|
|
|
|
**Files:**
|
|
- Create: `nlp-service/Dockerfile`
|
|
|
|
- [ ] **Step 1: Create `nlp-service/Dockerfile`**
|
|
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
WORKDIR /app
|
|
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
curl \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
# Bake models into the image — no volume needed, ~350 MB total
|
|
RUN python -m spacy download de_core_news_sm \
|
|
&& python -m spacy download en_core_web_sm \
|
|
&& python -m spacy download es_core_news_sm
|
|
|
|
COPY . .
|
|
|
|
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \
|
|
&& chown -R nlp:nlp /app
|
|
|
|
USER nlp
|
|
|
|
EXPOSE 8001
|
|
|
|
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
|
|
CMD curl -f http://localhost:8001/health || exit 1
|
|
|
|
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
|
|
```
|
|
|
|
- [ ] **Step 2: Build the image**
|
|
|
|
```bash
|
|
cd nlp-service
|
|
docker build -t nlp-service:prototype .
|
|
```
|
|
|
|
Expected: build completes, image ~350 MB
|
|
|
|
- [ ] **Step 3: Run and smoke-test the container**
|
|
|
|
```bash
|
|
docker run --rm -d -p 8001:8001 --name nlp-test nlp-service:prototype
|
|
sleep 5
|
|
|
|
curl -s http://localhost:8001/health
|
|
# Expected: {"status":"ok"}
|
|
|
|
curl -s -X POST http://localhost:8001/parse \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "Briefe aus dem Krieg", "lang": "de"}' | python3 -m json.tool
|
|
|
|
docker stop nlp-test
|
|
```
|
|
|
|
- [ ] **Step 4: Commit**
|
|
|
|
```bash
|
|
git add nlp-service/Dockerfile
|
|
git commit -m "feat(nlp-service): Dockerfile — python:3.11-slim, models baked in"
|
|
```
|