feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher

Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-06-07 11:00:03 +02:00
parent 9472d8c25e
commit 6c5cf8ec9b
8 changed files with 939 additions and 551 deletions

View File

@@ -5,23 +5,30 @@ replacing Ollama for the Familienarchiv NL search feature.
## Stack ## Stack
- Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2 - Python 3.11, FastAPI 0.115, rapidfuzz 3.x, dateparser 1.2, psycopg2-binary
No ML models — persons are matched against the live DB via fuzzy lookup.
## Endpoints ## Endpoints
- `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract - `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract
- `GET /health` — returns `{"status": "ok"}` when all models are loaded - `GET /health` — returns `{"status": "ok", "persons_loaded": N}`
## Running locally ## Running locally
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
# Without DB (empty person matcher — dates and keywords still work):
uvicorn main:app --reload --port 8001 uvicorn main:app --reload --port 8001
# With DB (full person matching):
DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
uvicorn main:app --reload --port 8001
curl -X POST http://localhost:8001/parse \ curl -X POST http://localhost:8001/parse \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}' -d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'
``` ```
## Testing ## Testing
@@ -30,6 +37,14 @@ curl -X POST http://localhost:8001/parse \
pytest -v pytest -v
``` ```
No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.
## Architecture
- `person_matcher.py` — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
- `extractor.py` — pipeline: persons → role → dates (regex) → keywords (stopword filter)
- `main.py` — FastAPI app; reads `DATABASE_URL` env var at startup
## Design spec ## Design spec
See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`. See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
@@ -39,3 +54,5 @@ See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
This is a **prototype** for extraction quality evaluation. No docker-compose integration or This is a **prototype** for extraction quality evaluation. No docker-compose integration or
Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in
`backend/src/main/java/org/raddatz/familienarchiv/search/`. `backend/src/main/java/org/raddatz/familienarchiv/search/`.
Test sentences for manual evaluation are in `test_sentences.md`.

View File

@@ -1,46 +1,33 @@
"""Rule-based NLP pipeline: dates via regex, persons via DB-backed matcher."""
from __future__ import annotations from __future__ import annotations
import re import re
from datetime import date from datetime import date
from typing import TYPE_CHECKING
import dateparser import dateparser
import spacy
from spacy.language import Language
from models import ParseResponse from models import ParseResponse
from person_matcher import PersonMatcher
# ── Language model registry ────────────────────────────────────────────────── if TYPE_CHECKING:
pass
_MODEL_NAMES: dict[str, str] = { # ── Module-level PersonMatcher (set at startup) ───────────────────────────────
"de": "de_core_news_sm",
"en": "en_core_web_sm",
"es": "es_core_news_sm",
}
_nlp_cache: dict[str, Language] = {} _matcher: PersonMatcher | None = None
def get_nlp(lang: str) -> Language: def set_person_matcher(m: PersonMatcher) -> None:
if lang not in _MODEL_NAMES: global _matcher
raise ValueError(f"Unsupported language: {lang!r}. Valid: {list(_MODEL_NAMES)}") _matcher = m
if lang not in _nlp_cache:
_nlp_cache[lang] = spacy.load(_MODEL_NAMES[lang])
return _nlp_cache[lang]
def load_all_models() -> None: def get_person_matcher() -> PersonMatcher | None:
for lang in _MODEL_NAMES: return _matcher
get_nlp(lang)
# ── Step 1: Person name extraction ────────────────────────────────────────── # ── Preposition sets ──────────────────────────────────────────────────────────
def extract_person_names(doc) -> list[str]:
"""Return PER entity texts in left-to-right span order."""
return [ent.text for ent in doc.ents if ent.label_ == "PER"]
# ── Step 2: Role detection ───────────────────────────────────────────────────
_SENDER_PREPS: dict[str, frozenset[str]] = { _SENDER_PREPS: dict[str, frozenset[str]] = {
"de": frozenset({"von", "vom"}), "de": frozenset({"von", "vom"}),
@@ -54,43 +41,12 @@ _RECEIVER_PREPS: dict[str, frozenset[str]] = {
"es": frozenset({"para", "a"}), "es": frozenset({"para", "a"}),
} }
_ALL_PERSON_PREPS: dict[str, frozenset[str]] = {
lang: _SENDER_PREPS[lang] | _RECEIVER_PREPS[lang]
for lang in ("de", "en", "es")
}
def detect_person_role(doc, per_spans: list, lang: str) -> str: # ── Date direction tokens ─────────────────────────────────────────────────────
"""Return 'sender', 'receiver', or 'any'.
Only meaningful for single-PER queries — two-person queries always return
'any' because Java derives direction from list position.
"""
if len(per_spans) != 1:
return "any"
span = per_spans[0]
root = span.root
sender = _SENDER_PREPS[lang]
receiver = _RECEIVER_PREPS[lang]
# Primary: dependency-tree children of the PER root
for child in root.children:
if child.dep_ in ("case", "prep", "mo"):
if child.lower_ in sender:
return "sender"
if child.lower_ in receiver:
return "receiver"
# Fallback: token immediately before the span start
if span.start > 0:
prev = doc[span.start - 1]
if prev.lower_ in sender:
return "sender"
if prev.lower_ in receiver:
return "receiver"
return "any"
# ── Step 3: Date parsing ─────────────────────────────────────────────────────
_YEAR_RE = re.compile(r"^\d{4}$")
_DATE_BEFORE: dict[str, frozenset[str]] = { _DATE_BEFORE: dict[str, frozenset[str]] = {
"de": frozenset({"vor"}), "de": frozenset({"vor"}),
@@ -110,130 +66,219 @@ _DATE_BETWEEN: dict[str, frozenset[str]] = {
"es": frozenset({"entre"}), "es": frozenset({"entre"}),
} }
# ── Stopword lists ────────────────────────────────────────────────────────────
def _parse_date_text(text: str, lang: str) -> date | None: _STOPWORDS: dict[str, frozenset[str]] = {
text = text.strip() "de": frozenset({
if _YEAR_RE.match(text): "der", "die", "das", "des", "dem", "den",
year = int(text) "ein", "eine", "einem", "einen", "einer", "eines",
if 1000 < year < 3000: "er", "sie", "es", "wir", "ihr", "ich", "du",
return date(year, 1, 1) "und", "oder", "aber", "doch", "auch", "noch", "nur",
parsed = dateparser.parse( "in", "an", "auf", "aus", "bei", "mit", "nach", "von", "vom",
text, "vor", "zu", "zur", "zum", "durch", "für", "über", "unter",
languages=[lang], "zwischen", "gegen", "ohne", "um", "bis", "seit", "wegen",
settings={"PREFER_DAY_OF_MONTH": "first", "RETURN_AS_TIMEZONE_AWARE": False}, "ist", "sind", "war", "waren", "wird", "werden",
) "hat", "haben", "hatte", "hatten",
return parsed.date() if parsed else None "sein", "seine", "seinen", "seiner", "seines",
"ihre", "ihren", "ihrer", "ihrem", "ihres",
"nicht", "kein", "keine", "keinen", "keinem", "keines",
"so", "wie", "als", "da", "hier", "dort", "wo", "wer", "was",
"im", "am", "beim", "ins", "ans",
"ja", "nein", "denn", "wenn", "weil", "dass", "ob", "damit",
"alle", "alles", "mehr", "sehr", "viel", "wenig",
"diesem", "dieser", "dieses", "diese", "diesen",
"jetzt", "dann", "nun", "schon", "wohl", "wurde", "wurden",
"worden", "geschrieben", "seinen", "ihrer",
"beim", "nach", "zum", "zur", "dem", "den",
"seine", "ihrem", "Jahr", "Jahren", "jahre", "jahr",
}),
"en": frozenset({
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
"for", "of", "with", "by", "from", "about", "as", "into",
"through", "is", "are", "was", "were", "be", "been", "being",
"have", "has", "had", "do", "does", "did", "will", "would",
"could", "should", "may", "might", "must", "shall", "can",
"i", "you", "he", "she", "it", "we", "they", "their", "our",
"his", "her", "its", "my", "your",
"this", "that", "these", "those", "all", "not", "no", "nor",
"very", "more", "most", "much", "many", "some", "any",
"before", "after", "between", "during", "since", "until",
"when", "where", "who", "which", "what", "how",
}),
"es": frozenset({
"el", "la", "los", "las", "un", "una", "unos", "unas",
"y", "o", "pero", "sin", "con", "en", "de", "del", "al",
"a", "ante", "bajo", "desde", "entre", "hacia", "hasta",
"para", "por", "sobre", "tras",
"es", "son", "era", "eran", "fue", "fueron", "ser", "estar",
"ha", "han", "he", "tener", "tiene",
"yo", "su", "sus", "mi", "tu",
"este", "esta", "estos", "estas", "ese", "esa",
"no", "muy", "todo", "todos", "toda",
"que", "cuando", "donde", "como",
"antes", "después", "durante", "desde", "hasta",
}),
}
# ── Year regex ────────────────────────────────────────────────────────────────
_YEAR_RE = re.compile(r"\b(\d{4})\b")
_WORD_RE = re.compile(r"\b[^\W\d_]{3,}\b", re.UNICODE)
def _year_end(d: date) -> date: # ── Step 1 + 2: Person extraction and role detection ─────────────────────────
"""If d is Jan 1, return Dec 31 of the same year (year-only boundary)."""
if d.month == 1 and d.day == 1: def _extract_persons_and_role(
return date(d.year, 12, 31) query: str,
return d lang: str,
) -> tuple[list[str], str]:
"""Return (person_names, role) using the DB-backed PersonMatcher."""
m = _matcher
if m is None or len(m) == 0:
return [], "any"
preps = _ALL_PERSON_PREPS[lang]
stops = preps | _DATE_BEFORE[lang] | _DATE_AFTER[lang] | _DATE_BETWEEN[lang]
matches = m.find_in_query(query, preps, stop_tokens=stops)
person_names = [text for text, _ in matches]
if len(matches) != 1:
return person_names, "any"
_, prep = matches[0]
if prep is None:
return person_names, "any"
if prep in _SENDER_PREPS[lang]:
return person_names, "sender"
if prep in _RECEIVER_PREPS[lang]:
return person_names, "receiver"
return person_names, "any"
def _find_year_spans(doc) -> list: # ── Step 3: Date extraction ───────────────────────────────────────────────────
"""Fallback: find tokens that look like 4-digit years (10002999) when NER
produces no DATE entities. Returns a list of single-token pseudo-spans
(spaCy Span objects) labelled 'DATE'."""
spans = []
for token in doc:
if _YEAR_RE.match(token.text):
year = int(token.text)
if 1000 < year < 3000:
span = doc[token.i : token.i + 1]
spans.append(span)
return spans
def _find_years(query: str) -> list[tuple[int, int, int]]:
def extract_dates(doc, lang: str) -> tuple[str | None, str | None]: """Return list of (start, end, year_int) for valid 4-digit year tokens."""
"""Return (date_from, date_to) as ISO strings or None.""" return [
date_spans = [ent for ent in doc.ents if ent.label_ == "DATE"] (m.start(), m.end(), int(m.group()))
for m in _YEAR_RE.finditer(query)
# Fallback: some spaCy small models (de, es) don't tag bare years as DATE if 1000 < int(m.group()) < 3000
if not date_spans:
date_spans = _find_year_spans(doc)
if not date_spans:
return None, None
between_tokens = _DATE_BETWEEN[lang]
before_tokens = _DATE_BEFORE[lang]
after_tokens = _DATE_AFTER[lang]
# "zwischen X und Y" / "between X and Y" — two DATE spans form a range
has_between = any(tok.lower_ in between_tokens for tok in doc)
if has_between and len(date_spans) >= 2:
parsed = []
for span in date_spans[:2]:
d = _parse_date_text(span.text, lang)
if d:
parsed.append(d)
if len(parsed) == 2:
parsed.sort()
return parsed[0].isoformat(), _year_end(parsed[1]).isoformat()
# Single DATE span — use direction token
span = date_spans[0]
d = _parse_date_text(span.text, lang)
if not d:
return None, None
# Check up to 2 tokens before the date span to handle multi-word prepositions
# like Spanish "antes de 1920" where the keyword is 2 tokens back.
prev_tokens = [
doc[span.start - i].lower_
for i in range(1, min(3, span.start + 1))
] ]
if any(t in before_tokens for t in prev_tokens):
return None, _year_end(d).isoformat() def _direction_before_year(
if any(t in after_tokens for t in prev_tokens): query: str,
return d.isoformat(), None year_start: int,
# Bare year/date — closed year-range lang: str,
return d.isoformat(), _year_end(d).isoformat() person_names: list[str],
) -> str:
"""Classify direction of the date span as 'before', 'after', or 'bare'.
Looks at the two tokens immediately preceding the year. If the closer
token is a matched person name part, the direction word belongs to that
person — not to the year — so we return 'bare'.
"""
prefix_words = query[:year_start].split()
if not prefix_words:
return "bare"
person_tokens = {w.lower() for name in person_names for w in name.split()}
recent = [w.lower() for w in prefix_words[-2:]]
before_set = _DATE_BEFORE[lang]
after_set = _DATE_AFTER[lang]
for direction_tok in reversed(recent): # closest first
if direction_tok in before_set:
# Only use this if the word immediately before the year is not a person
if recent[-1] in person_tokens:
return "bare"
return "before"
if direction_tok in after_set:
if recent[-1] in person_tokens:
return "bare"
return "after"
return "bare"
# ── Step 4: Keyword extraction ─────────────────────────────────────────────── def extract_dates(
query: str,
lang: str,
person_names: list[str] | None = None,
) -> tuple[str | None, str | None]:
"""Return (date_from, date_to) as ISO strings or None."""
if person_names is None:
person_names = []
def extract_keywords(doc, excluded_spans: list) -> list[str]: year_spans = _find_years(query)
"""Return lowercased lemmas of content words not inside any NER span.""" if not year_spans:
excluded_indices: set[int] = set() return None, None
for span in excluded_spans:
excluded_indices.update(range(span.start, span.end))
# "zwischen X und Y" / "between X and Y" — two years form a range
query_lower = query.lower()
if any(w in query_lower.split() for w in _DATE_BETWEEN[lang]) and len(year_spans) >= 2:
years = sorted([y for _, _, y in year_spans[:2]])
return date(years[0], 1, 1).isoformat(), date(years[1], 12, 31).isoformat()
start, end, year = year_spans[0]
direction = _direction_before_year(query, start, lang, person_names)
if direction == "before":
return None, date(year, 12, 31).isoformat()
if direction == "after":
return date(year, 1, 1).isoformat(), None
# bare year → closed year range
return date(year, 1, 1).isoformat(), date(year, 12, 31).isoformat()
# ── Step 4: Keyword extraction ────────────────────────────────────────────────
def extract_keywords(
query: str,
lang: str,
person_spans: list[str],
year_strings: list[str],
) -> list[str]:
"""Return lowercased content words after removing persons, years, stopwords."""
text = query
# Remove matched person spans (longest first to avoid partial replacements)
for span in sorted(person_spans, key=len, reverse=True):
text = re.sub(
r"(?<!\w)" + re.escape(span) + r"(?!\w)",
" ",
text,
flags=re.IGNORECASE,
)
# Remove year tokens
for yr in year_strings:
text = re.sub(r"\b" + re.escape(yr) + r"\b", " ", text)
stopwords = _STOPWORDS.get(lang, frozenset())
seen: set[str] = set() seen: set[str] = set()
keywords: list[str] = [] result: list[str] = []
for token in doc:
if token.i in excluded_indices:
continue
if token.pos_ not in ("NOUN", "PROPN"):
continue
if token.is_stop:
continue
lemma = token.lemma_.lower()
if len(lemma) < 3:
continue
if lemma not in seen:
seen.add(lemma)
keywords.append(lemma)
return keywords for tok in _WORD_RE.findall(text):
lower = tok.lower()
if lower in stopwords or lower in seen:
continue
seen.add(lower)
result.append(lower)
return result
# ── Step 5: Assembly ───────────────────────────────────────────────────────── # ── Step 5: Assembly ─────────────────────────────────────────────────────────
def extract(query: str, lang: str) -> ParseResponse: def extract(query: str, lang: str) -> ParseResponse:
"""Run the full NLP pipeline and return a ParseResponse.""" """Run the full rule-based pipeline and return a ParseResponse."""
nlp = get_nlp(lang) person_names, person_role = _extract_persons_and_role(query, lang)
doc = nlp(query) year_strings = [str(y) for _, _, y in _find_years(query)]
date_from, date_to = extract_dates(query, lang, person_names)
per_spans = [ent for ent in doc.ents if ent.label_ == "PER"] keywords = extract_keywords(query, lang, person_names, year_strings)
person_names = extract_person_names(doc)
person_role = detect_person_role(doc, per_spans, lang)
date_from, date_to = extract_dates(doc, lang)
keywords = extract_keywords(doc, list(doc.ents))
return ParseResponse( return ParseResponse(
personNames=person_names, personNames=person_names,

View File

@@ -1,19 +1,38 @@
import logging """FastAPI app — /parse and /health endpoints."""
from __future__ import annotations
import os
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException from fastapi import FastAPI, HTTPException
from extractor import extract, load_all_models from extractor import extract, get_person_matcher, set_person_matcher
from models import ParseRequest, ParseResponse from models import ParseRequest, ParseResponse
from person_matcher import PersonMatcher
logger = logging.getLogger(__name__)
def _load_persons_from_db(db_url: str) -> list[tuple[str | None, str | None]]:
import psycopg2 # deferred — not available in test environments without a DB
conn = psycopg2.connect(db_url)
try:
cur = conn.cursor()
cur.execute("SELECT first_name, last_name FROM persons")
return cur.fetchall()
finally:
conn.close()
@asynccontextmanager @asynccontextmanager
async def lifespan(app: FastAPI): async def lifespan(app: FastAPI):
logger.info("Loading spaCy models...") # Only initialise the matcher when nothing was pre-seeded (e.g., by tests).
load_all_models() if get_person_matcher() is None:
logger.info("All models ready.") m = PersonMatcher()
db_url = os.environ.get("DATABASE_URL")
if db_url:
rows = _load_persons_from_db(db_url)
m.load(rows)
set_person_matcher(m)
yield yield
@@ -22,7 +41,8 @@ app = FastAPI(lifespan=lifespan)
@app.get("/health") @app.get("/health")
def health() -> dict: def health() -> dict:
return {"status": "ok"} m = get_person_matcher()
return {"status": "ok", "persons_loaded": len(m) if m else 0}
@app.post("/parse", response_model=ParseResponse) @app.post("/parse", response_model=ParseResponse)
@@ -30,5 +50,4 @@ def parse(request: ParseRequest) -> ParseResponse:
try: try:
return extract(request.query, request.lang) return extract(request.query, request.lang)
except Exception as exc: except Exception as exc:
logger.exception("Extraction pipeline failed")
raise HTTPException(status_code=500, detail=str(exc)) from exc raise HTTPException(status_code=500, detail=str(exc)) from exc

View File

@@ -0,0 +1,164 @@
"""DB-backed person name matcher with fuzzy search."""
from __future__ import annotations
import re
from rapidfuzz import fuzz, process
_PUNCT_RE = re.compile(r"[^\w\s\-]", re.UNICODE)
_YEAR_PAT = re.compile(r"^\d{4}$")
class PersonMatcher:
"""Match person name fragments from free-text queries against known persons.
Loaded once at startup from (first_name, last_name) DB rows. At query
time, scans for tokens following person-indicator prepositions and fuzzy-
matches them against the loaded name variants. Returns the original query
text (not the resolved DB name) so the Java resolveNames() mechanism can
do its own disambiguation.
"""
def __init__(self) -> None:
self._names: list[str] = [] # lowercase name variants
# ── Loading ───────────────────────────────────────────────────────────────
def load(self, rows: list[tuple[str | None, str | None]]) -> None:
"""Populate from DB rows of (first_name, last_name)."""
seen: set[str] = set()
for first, last in rows:
first = (first or "").strip()
last = (last or "").strip()
for variant in _name_variants(first, last):
key = variant.lower()
if key not in seen:
seen.add(key)
self._names.append(key)
def __len__(self) -> int:
return len(self._names)
# ── Query-time matching ───────────────────────────────────────────────────
def find_in_query(
self,
query: str,
prepositions: frozenset[str],
stop_tokens: frozenset[str] | None = None,
threshold: int = 80,
) -> list[tuple[str, str | None]]:
"""Find person name spans in *query*.
Returns a list of ``(original_query_text, anchoring_prep_or_None)``
in left-to-right order.
Parameters
----------
prepositions:
Person-indicator prepositions for the query language (triggers a
scan for the tokens that follow).
stop_tokens:
Tokens that terminate a name span (prepositions + date-direction
words). "de" is a special exception: when immediately followed by
a capitalised word it is treated as a name connector (e.g.
"de Gruyter") rather than a stop.
threshold:
Minimum rapidfuzz token_sort_ratio score to accept a match.
Strategy
--------
Pass 1 — prep-anchored: for each person-indicator preposition found in
the token list, collect up to 3 consecutive non-stop, non-year tokens
after it and fuzzy-match the resulting span against loaded names.
Longest match wins.
Pass 2 — full-name scan: scan positions not yet consumed for exact
multi-word full-name matches (no preposition anchor required).
"""
tokens = query.split()
clean = [_PUNCT_RE.sub("", t) for t in tokens]
lower = [t.lower() for t in clean]
# Prepositions always terminate a name span, even without explicit stop_tokens.
stops = (stop_tokens or frozenset()) | prepositions
consumed: set[int] = set()
hits: list[tuple[int, str, str | None]] = [] # (position, text, prep)
# Pass 1 — prep-anchored
for i, ltok in enumerate(lower):
if ltok not in prepositions or i + 1 >= len(tokens):
continue
# Build candidate span — stop at stop tokens or 4-digit years.
# Exception: "de" before a capitalised word is a name connector.
span_indices: list[int] = []
j = i + 1
while j < len(tokens) and len(span_indices) < 3:
if j in consumed:
break
t = lower[j]
if t in stops or _YEAR_PAT.match(clean[j]):
# Allow "de" when the *next* token starts with a capital —
# e.g. "Walter de Gruyter".
next_clean = clean[j + 1] if j + 1 < len(tokens) else ""
if t == "de" and next_clean[:1].isupper():
pass # connector — keep going
else:
break
span_indices.append(j)
j += 1
# Try longest match first, then shorter spans
for span_len in range(len(span_indices), 0, -1):
idx = span_indices[:span_len]
span_lower = " ".join(lower[k] for k in idx)
if self._is_match(span_lower, threshold):
hits.append((idx[0], " ".join(tokens[k] for k in idx), ltok))
consumed.update(idx)
break
# Pass 2 — full multi-word name scan (exact only, no preposition needed)
for span_len in (3, 2):
for i in range(len(tokens) - span_len + 1):
span_idx = range(i, i + span_len)
if any(j in consumed for j in span_idx):
continue
span_lower = " ".join(lower[i : i + span_len])
if span_lower in self._names:
hits.append((i, " ".join(tokens[i : i + span_len]), None))
consumed.update(span_idx)
hits.sort(key=lambda h: h[0])
return [(text, prep) for _, text, prep in hits]
# ── Internal helpers ──────────────────────────────────────────────────────
def _is_match(self, text: str, threshold: int) -> bool:
"""Return True if *text* fuzzy-matches any loaded name at >= threshold."""
if not self._names or len(text.strip()) < 3:
return False
text_lower = text.strip().lower()
if text_lower in self._names:
return True # exact match — fast path
result = process.extractOne(
text_lower,
self._names,
scorer=fuzz.token_sort_ratio,
score_cutoff=threshold,
)
return result is not None
# ── helpers ───────────────────────────────────────────────────────────────────
def _name_variants(first: str, last: str) -> list[str]:
"""Return the name variants to index for a single person."""
variants = []
if first and last:
variants.append(f"{first} {last}")
if first:
variants.append(first)
if last:
variants.append(last)
return variants

View File

@@ -1,6 +1,7 @@
fastapi[standard]==0.115.6 fastapi[standard]==0.115.6
uvicorn[standard]==0.34.0 uvicorn[standard]==0.34.0
spacy>=3.8,<4.0
dateparser>=1.2,<2.0 dateparser>=1.2,<2.0
rapidfuzz>=3.0,<4.0
psycopg2-binary>=2.9,<3.0
pytest>=8.0,<9.0 pytest>=8.0,<9.0
httpx>=0.28,<1.0 httpx>=0.28,<1.0

View File

@@ -1,350 +1,337 @@
"""Tests for the rule-based extractor and PersonMatcher."""
import pytest import pytest
from pydantic import ValidationError
from extractor import extract, extract_dates, extract_keywords, set_person_matcher
# ── Models ────────────────────────────────────────────────────────────────── from person_matcher import PersonMatcher
def test_parse_request_valid(): # ── Shared test fixture ───────────────────────────────────────────────────────
from models import ParseRequest
req = ParseRequest(query="Briefe von Opa", lang="de") _TEST_PERSONS = [
assert req.query == "Briefe von Opa" ("Clara", "Cram"),
assert req.lang == "de" ("Herbert", "Cram"),
("Eugenie", "de Gruyter"),
("Walter", "de Gruyter"),
def test_parse_request_rejects_unknown_lang(): ("Marie", "Cram"),
from models import ParseRequest ("Juan", "Cram"),
with pytest.raises(ValidationError): ("Hilde", "de Gruyter"),
ParseRequest(query="Letters from grandpa", lang="fr") ("Hans", "de Gruyter"),
("Albert", "de Gruyter"),
("Anita", "Wöhler"),
def test_parse_response_serializes_nulls(): ("Else", "Bohrmann"),
from models import ParseResponse ("Lili", "Duvenbeck"),
resp = ParseResponse( ]
personNames=["Opa"],
personRole="sender",
dateFrom=None, @pytest.fixture(scope="session", autouse=True)
dateTo="1920-12-31", def seeded_matcher():
keywords=["brief"], """Load test persons into the global matcher before any test runs."""
rawQuery="Briefe von Opa", m = PersonMatcher()
) m.load(_TEST_PERSONS)
data = resp.model_dump() set_person_matcher(m)
assert data["dateFrom"] is None return m
assert data["dateTo"] == "1920-12-31"
assert data["personRole"] == "sender"
# ── PersonMatcher unit tests ──────────────────────────────────────────────────
# ── Model loading ──────────────────────────────────────────────────────────── class TestPersonMatcher:
DE_PREPS = frozenset({"von", "vom", "an", "nach", "für"})
@pytest.fixture(scope="session")
def nlp_de(): def test_load_populates_names(self, seeded_matcher):
from extractor import get_nlp assert len(seeded_matcher) > 0
return get_nlp("de")
def test_exact_full_name_match(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe von Clara Cram", self.DE_PREPS)
@pytest.fixture(scope="session") assert hits == [("Clara Cram", "von")]
def nlp_en():
from extractor import get_nlp def test_exact_first_name_only(self, seeded_matcher):
return get_nlp("en") hits = seeded_matcher.find_in_query("Briefe von Eugenie", self.DE_PREPS)
assert hits == [("Eugenie", "von")]
@pytest.fixture(scope="session") def test_exact_first_name_receiver(self, seeded_matcher):
def nlp_es(): hits = seeded_matcher.find_in_query("Briefe an Herbert", self.DE_PREPS)
from extractor import get_nlp assert hits == [("Herbert", "an")]
return get_nlp("es")
def test_fuzzy_typo(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Briefe von Herrbert Cram", self.DE_PREPS)
def test_get_nlp_de_loads(nlp_de): assert len(hits) == 1
doc = nlp_de("Test") assert hits[0][1] == "von"
assert doc is not None
def test_two_persons_extracted(self, seeded_matcher):
hits = seeded_matcher.find_in_query(
def test_get_nlp_en_loads(nlp_en): "Briefe von Clara Cram an Herbert Cram", self.DE_PREPS
doc = nlp_en("Test") )
assert doc is not None assert len(hits) == 2
assert hits[0][0] == "Clara Cram"
assert hits[0][1] == "von"
def test_get_nlp_es_loads(nlp_es): assert hits[1][0] == "Herbert Cram"
doc = nlp_es("Prueba") assert hits[1][1] == "an"
assert doc is not None
def test_no_match_for_place_name(self, seeded_matcher):
hits = seeded_matcher.find_in_query("Reise nach Mexiko", self.DE_PREPS)
def test_get_nlp_unknown_lang_raises(): assert hits == []
from extractor import get_nlp
with pytest.raises(ValueError, match="Unsupported language"): def test_no_match_for_topic_word(self, seeded_matcher):
get_nlp("fr") hits = seeded_matcher.find_in_query("Briefe aus dem Krieg", self.DE_PREPS)
assert hits == []
# ── Person name extraction ─────────────────────────────────────────────────── def test_first_name_eugenie_regression(self, seeded_matcher):
# spaCy NER missed standalone first names
def _make_doc_with_ents(nlp, text: str, char_ents: list[tuple[int, int, str]]): hits = seeded_matcher.find_in_query("Briefe von Eugenie", self.DE_PREPS)
"""Create a Doc with manually injected entity spans (no NER model needed).""" assert len(hits) == 1
doc = nlp.make_doc(text)
spans = [doc.char_span(s, e, label=lbl) for s, e, lbl in char_ents] def test_merged_names_regression(self, seeded_matcher):
doc.ents = [sp for sp in spans if sp is not None] # spaCy NER merged "Herbert an Eugenie de Gruyter" into one PER span
return doc hits = seeded_matcher.find_in_query(
"Briefe von Herbert an Eugenie de Gruyter nach 1914", self.DE_PREPS
)
def test_extract_person_names_two_persons(nlp_de): assert len(hits) == 2
from extractor import extract_person_names names = [h[0] for h in hits]
# "Briefe von Opa Hermann an Marie" assert "Herbert" in names
# "Opa Hermann" = chars 11..22, "Marie" = chars 26..31 assert "Eugenie de Gruyter" in names
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa Hermann an Marie", [
(11, 22, "PER"), def test_english_preps(self, seeded_matcher):
(26, 31, "PER"), en_preps = frozenset({"from", "by", "to", "for"})
]) hits = seeded_matcher.find_in_query(
assert extract_person_names(doc) == ["Opa Hermann", "Marie"] "Letters from Clara Cram to Walter de Gruyter in 1920", en_preps
)
assert len(hits) == 2
def test_extract_person_names_preserves_order(nlp_de): assert hits[0][0] == "Clara Cram"
from extractor import extract_person_names assert hits[1][0] == "Walter de Gruyter"
# "Marie von Opa" — Marie comes first in text
# "Marie" = 0..5, "Opa" = 10..13 def test_double_preposition_de(self, seeded_matcher):
doc = _make_doc_with_ents(nlp_de, "Marie von Opa", [ hits = seeded_matcher.find_in_query(
(0, 5, "PER"), "Briefe von Clara nach Herbert", self.DE_PREPS
(10, 13, "PER"), )
]) assert len(hits) == 2
assert extract_person_names(doc) == ["Marie", "Opa"] names = [h[0] for h in hits]
assert "Clara" in names
assert "Herbert" in names
def test_extract_person_names_empty(nlp_de):
from extractor import extract_person_names
doc = _make_doc_with_ents(nlp_de, "Briefe aus dem Krieg", []) # ── Date extraction tests ─────────────────────────────────────────────────────
assert extract_person_names(doc) == []
class TestExtractDates:
def test_bare_year_gives_range(self):
def test_extract_person_names_ignores_non_per(nlp_de): assert extract_dates("Briefe 1920", "de") == ("1920-01-01", "1920-12-31")
from extractor import extract_person_names
# DATE entity should not appear in personNames def test_im_jahr(self):
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) assert extract_dates("Schriften im Jahr 1905", "de") == (
assert extract_person_names(doc) == [] "1905-01-01", "1905-12-31"
)
# ── Role detection ─────────────────────────────────────────────────────────── def test_vor_year(self):
assert extract_dates("Briefe vor 1920", "de") == (None, "1920-12-31")
def test_role_sender_von(nlp_de):
from extractor import detect_person_role def test_nach_year(self):
# "Briefe von Marie" — "von" immediately before "Marie" assert extract_dates("Schriften nach 1920", "de") == ("1920-01-01", None)
# "Marie" = chars 11..16
doc = _make_doc_with_ents(nlp_de, "Briefe von Marie", [(11, 16, "PER")]) def test_zwischen(self):
per_spans = list(doc.ents) assert extract_dates("Dokumente zwischen 1914 und 1918", "de") == (
assert detect_person_role(doc, per_spans, "de") == "sender" "1914-01-01", "1918-12-31"
)
def test_role_receiver_an(nlp_de): def test_before_en(self):
from extractor import detect_person_role assert extract_dates("Letters before 1918", "en") == (None, "1918-12-31")
# "Briefe an Marie" — "an" immediately before "Marie"
# "Marie" = chars 10..15 def test_after_en(self):
doc = _make_doc_with_ents(nlp_de, "Briefe an Marie", [(10, 15, "PER")]) assert extract_dates("Letters after 1939", "en") == ("1939-01-01", None)
per_spans = list(doc.ents)
assert detect_person_role(doc, per_spans, "de") == "receiver" def test_between_en(self):
assert extract_dates("Letters between 1914 and 1918", "en") == (
"1914-01-01", "1918-12-31"
def test_role_two_persons_returns_any(nlp_de): )
from extractor import detect_person_role
# "von Opa an Marie" — two PER spans → always "any" def test_antes_de_es(self):
# "Opa" = chars 4..7, "Marie" = chars 11..16 assert extract_dates("Cartas antes de 1900", "es") == (None, "1900-12-31")
doc = _make_doc_with_ents(nlp_de, "von Opa an Marie", [
(4, 7, "PER"), def test_entre_es(self):
(11, 16, "PER"), assert extract_dates("entre 1915 y 1920", "es") == (
]) "1915-01-01", "1920-12-31"
per_spans = list(doc.ents) )
assert detect_person_role(doc, per_spans, "de") == "any"
def test_no_year(self):
assert extract_dates("Briefe aus dem Krieg", "de") == (None, None)
def test_role_no_prep_returns_any(nlp_de):
from extractor import detect_person_role def test_nach_before_person_then_year(self):
# "Briefe Marie" — no preposition # "nach Marie 1920" — "nach" belongs to person, not date
# "Marie" = chars 7..12 date_from, date_to = extract_dates("Briefe nach Marie 1920", "de", ["Marie"])
doc = _make_doc_with_ents(nlp_de, "Briefe Marie", [(7, 12, "PER")]) assert date_from == "1920-01-01"
per_spans = list(doc.ents) assert date_to == "1920-12-31"
assert detect_person_role(doc, per_spans, "de") == "any"
def test_bare_year_alone(self):
assert extract_dates("1918", "de") == ("1918-01-01", "1918-12-31")
def test_role_empty_returns_any(nlp_de):
from extractor import detect_person_role
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", []) # ── Keyword extraction tests ──────────────────────────────────────────────────
assert detect_person_role(doc, [], "de") == "any"
class TestExtractKeywords:
def test_basic_topic_words(self):
def test_role_sender_from_english(nlp_en): kws = extract_keywords("Briefe aus dem Krieg", "de", [], [])
from extractor import detect_person_role assert "krieg" in kws
# "letters from Marie" — "from" before "Marie"
# "Marie" = chars 13..18 def test_stopwords_excluded(self):
doc = _make_doc_with_ents(nlp_en, "letters from Marie", [(13, 18, "PER")]) kws = extract_keywords("von der nach dem aus", "de", [], [])
per_spans = list(doc.ents) for sw in ("von", "der", "nach", "dem", "aus"):
assert detect_person_role(doc, per_spans, "en") == "sender" assert sw not in kws
def test_person_spans_excluded(self):
def test_role_receiver_to_english(nlp_en): kws = extract_keywords(
from extractor import detect_person_role "Briefe von Clara Cram nach Herbert", "de",
# "letters to Marie" — "to" before "Marie" ["Clara Cram", "Herbert"], []
# "letters" = 0..7, " " = 7, "to" = 8..10, " " = 10, "Marie" = 11..16 )
doc = _make_doc_with_ents(nlp_en, "letters to Marie", [(11, 16, "PER")]) assert "clara" not in kws
per_spans = list(doc.ents) assert "cram" not in kws
assert detect_person_role(doc, per_spans, "en") == "receiver" assert "herbert" not in kws
def test_years_excluded(self):
# ── Date parsing ───────────────────────────────────────────────────────────── kws = extract_keywords("Schriften 1920 über Reise", "de", [], ["1920"])
assert "1920" not in kws
def test_date_vor_1920(nlp_de):
from extractor import extract_dates def test_deduplication(self):
# "Briefe vor 1920" — "1920" at chars 11..15 kws = extract_keywords("Krieg Krieg Krieg", "de", [], [])
doc = _make_doc_with_ents(nlp_de, "Briefe vor 1920", [(11, 15, "DATE")]) assert kws.count("krieg") == 1
date_from, date_to = extract_dates(doc, "de")
assert date_from is None def test_en_stopwords(self):
assert date_to == "1920-12-31" kws = extract_keywords("Letters about the war", "en", [], [])
assert "the" not in kws
assert "war" in kws
def test_date_nach_1900(nlp_de):
from extractor import extract_dates def test_short_words_excluded(self):
# "Briefe nach 1900" — "1900" at chars 12..16 kws = extract_keywords("ab cd ef xy", "de", [], [])
doc = _make_doc_with_ents(nlp_de, "Briefe nach 1900", [(12, 16, "DATE")]) assert all(len(k) >= 3 for k in kws)
date_from, date_to = extract_dates(doc, "de")
assert date_from == "1900-01-01"
assert date_to is None # ── Full pipeline integration tests ──────────────────────────────────────────
class TestExtract:
def test_date_zwischen_1900_und_1920(nlp_de): def test_full_sentence_de(self):
from extractor import extract_dates r = extract("Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "de")
# "zwischen 1900 und 1920" assert "Clara Cram" in r.personNames
# "1900" = chars 9..13, "1920" = chars 18..22 assert "Walter de Gruyter" in r.personNames
doc = _make_doc_with_ents(nlp_de, "zwischen 1900 und 1920", [ assert r.personRole == "any"
(9, 13, "DATE"), assert r.dateFrom == "1920-01-01"
(18, 22, "DATE"), assert r.dateTo == "1920-12-31"
])
date_from, date_to = extract_dates(doc, "de") def test_sender_role_de(self):
assert date_from == "1900-01-01" r = extract("Briefe von Clara Cram vor 1910", "de")
assert date_to == "1920-12-31" assert r.personNames == ["Clara Cram"]
assert r.personRole == "sender"
assert r.dateTo == "1910-12-31"
def test_date_bare_year_makes_range(nlp_de): assert r.dateFrom is None
from extractor import extract_dates
# "Briefe 1920" — no direction token → year-range def test_receiver_role_de(self):
# "1920" = chars 7..11 r = extract("Briefe an Walter de Gruyter", "de")
doc = _make_doc_with_ents(nlp_de, "Briefe 1920", [(7, 11, "DATE")]) assert r.personNames == ["Walter de Gruyter"]
date_from, date_to = extract_dates(doc, "de") assert r.personRole == "receiver"
assert date_from == "1920-01-01"
assert date_to == "1920-12-31" def test_first_name_only_eugenie(self):
r = extract("Briefe von Eugenie", "de")
assert "Eugenie" in r.personNames
def test_date_no_date_entity(nlp_de): assert r.personRole == "sender"
from extractor import extract_dates
doc = _make_doc_with_ents(nlp_de, "Briefe von Opa", []) def test_first_name_only_herbert(self):
date_from, date_to = extract_dates(doc, "de") r = extract("Kriegsbriefe von Herbert", "de")
assert date_from is None assert "Herbert" in r.personNames
assert date_to is None
def test_merged_names_bug_fixed(self):
r = extract("Briefe von Herbert an Eugenie de Gruyter nach 1914", "de")
def test_date_before_english(nlp_en): assert "Herbert" in r.personNames
from extractor import extract_dates assert "Eugenie de Gruyter" in r.personNames
# "letters before 1920" — "1920" at chars 15..19 assert r.dateFrom == "1914-01-01"
doc = _make_doc_with_ents(nlp_en, "letters before 1920", [(15, 19, "DATE")])
date_from, date_to = extract_dates(doc, "en") def test_topic_only_krieg(self):
assert date_from is None r = extract("Briefe aus dem Krieg", "de")
assert date_to == "1920-12-31" assert r.personNames == []
assert "krieg" in r.keywords
def test_date_after_english(nlp_en): def test_topic_only_single_word(self):
from extractor import extract_dates r = extract("Kriegspost", "de")
# "letters after 1900" — "1900" at chars 14..18 assert r.personNames == []
doc = _make_doc_with_ents(nlp_en, "letters after 1900", [(14, 18, "DATE")])
date_from, date_to = extract_dates(doc, "en") def test_date_range_only(self):
assert date_from == "1900-01-01" r = extract("Dokumente zwischen 1914 und 1918", "de")
assert date_to is None assert r.personNames == []
assert r.dateFrom == "1914-01-01"
assert r.dateTo == "1918-12-31"
# ── Keyword extraction ───────────────────────────────────────────────────────
def test_colloquial_von(self):
def test_keywords_extracts_nouns(nlp_de): r = extract("von Clara", "de")
from extractor import extract_keywords assert r.personNames == ["Clara"]
# Use real NLP for POS tags; disable NER to avoid interference assert r.personRole == "sender"
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"])
keywords = extract_keywords(doc, []) def test_colloquial_an(self):
# "Brief" (NOUN) and "Krieg" (NOUN) should appear as lemmas r = extract("an Walter", "de")
assert "brief" in keywords assert r.personNames == ["Walter"]
assert "krieg" in keywords assert r.personRole == "receiver"
def test_bare_year_alone(self):
def test_keywords_excludes_stopwords(nlp_de): r = extract("1918", "de")
from extractor import extract_keywords assert r.dateFrom == "1918-01-01"
doc = nlp_de("Briefe aus dem Krieg", disable=["ner"]) assert r.dateTo == "1918-12-31"
keywords = extract_keywords(doc, []) assert r.personNames == []
# "dem" is a stopword article — must not appear
assert "dem" not in keywords def test_english_full_sentence(self):
r = extract("Letters from Clara Cram to Walter de Gruyter in 1920", "en")
assert "Clara Cram" in r.personNames
def test_keywords_excludes_per_ner_spans(nlp_de): assert "Walter de Gruyter" in r.personNames
from extractor import extract_keywords assert r.dateFrom == "1920-01-01"
# Run full NLP for POS tags, then inject a PER span over "Hermann"
# "Briefe von Hermann": B=0..6, ' '=6, v=7..10, ' '=10, H=11..18 def test_english_receiver_with_date(self):
doc = nlp_de("Briefe von Hermann") r = extract("Letters to Herbert Cram after 1939", "en")
per_span = doc.char_span(11, 18, label="PER") assert "Herbert Cram" in r.personNames
if per_span: assert r.personRole == "receiver"
doc.ents = [per_span] assert r.dateFrom == "1939-01-01"
keywords = extract_keywords(doc, list(doc.ents))
assert "hermann" not in keywords def test_english_birthday(self):
r = extract("Birthday greetings from Anita Wöhler", "en")
assert "Anita Wöhler" in r.personNames
def test_keywords_excludes_short_lemmas(nlp_de): assert r.personRole == "sender"
from extractor import extract_keywords
doc = nlp_de("Briefe an ihn", disable=["ner"]) def test_english_between_dates(self):
keywords = extract_keywords(doc, []) r = extract("Letters between 1914 and 1918", "en")
# "ihn" is 3 chars but is a stopword pronoun; "an" is 2 chars assert r.dateFrom == "1914-01-01"
assert "an" not in keywords assert r.dateTo == "1918-12-31"
def test_spanish_full_sentence(self):
def test_keywords_deduplicates(nlp_de): r = extract("Cartas de Clara Cram a Walter de Gruyter en 1920", "es")
from extractor import extract_keywords assert "Clara Cram" in r.personNames
doc = nlp_de("Brief Brief Krieg", disable=["ner"]) assert "Walter de Gruyter" in r.personNames
keywords = extract_keywords(doc, []) assert r.dateFrom == "1920-01-01"
assert keywords.count("brief") == 1
def test_spanish_before(self):
r = extract("Cartas antes de 1900", "es")
# ── Full extract() pipeline ────────────────────────────────────────────────── assert r.dateTo == "1900-12-31"
assert r.dateFrom is None
def test_extract_dates_de():
from extractor import extract def test_rawquery_echoed(self):
result = extract("Briefe vor 1920", "de") q = "test query"
assert result.dateFrom is None r = extract(q, "de")
assert result.dateTo == "1920-12-31" assert r.rawQuery == q
assert result.rawQuery == "Briefe vor 1920"
assert result.personNames == [] def test_false_positive_compound_noun_regression(self):
assert result.personRole == "any" # spaCy tagged "Geburtstagsglückwünsche" as a PER entity
r = extract("Geburtstagsglückwünsche", "de")
assert r.personNames == []
def test_extract_keywords_from_topic_de():
from extractor import extract def test_question_phrasing(self):
result = extract("Briefe aus dem Krieg", "de") r = extract("Wer hat an Herbert Cram 1918 geschrieben?", "de")
assert "krieg" in result.keywords assert "Herbert Cram" in r.personNames
assert result.dateFrom is None assert r.personRole == "receiver"
assert result.dateTo is None assert r.dateFrom == "1918-01-01"
def test_lowercase_query(self):
def test_extract_dates_en(): r = extract("briefe von clara cram an herbert 1920", "de")
from extractor import extract # Should still find persons despite lowercase
result = extract("letters before 1920", "en") assert len(r.personNames) >= 1
assert result.dateTo == "1920-12-31"
assert result.dateFrom is None def test_empty_matcher_returns_no_persons(self):
# Temporarily use an empty matcher
from extractor import set_person_matcher
def test_extract_dates_es(): empty = PersonMatcher()
from extractor import extract set_person_matcher(empty)
result = extract("cartas antes de 1920", "es") r = extract("Briefe von Clara Cram", "de")
assert result.dateTo == "1920-12-31" assert r.personNames == []
assert result.dateFrom is None # Restore seeded matcher
m = PersonMatcher()
m.load(_TEST_PERSONS)
def test_extract_rawquery_echoed(): set_person_matcher(m)
from extractor import extract
q = "Texte über Weihnachten"
result = extract(q, "de")
assert result.rawQuery == q
def test_extract_response_fields_are_complete():
from extractor import extract
result = extract("Briefe 1900", "de")
assert isinstance(result.personNames, list)
assert result.personRole in ("sender", "receiver", "any")
assert isinstance(result.keywords, list)
assert result.rawQuery == "Briefe 1900"

View File

@@ -1,43 +1,72 @@
"""Integration tests for the FastAPI app."""
import pytest import pytest
from fastapi.testclient import TestClient from fastapi.testclient import TestClient
from extractor import set_person_matcher
from person_matcher import PersonMatcher
_TEST_PERSONS = [
("Clara", "Cram"),
("Herbert", "Cram"),
("Eugenie", "de Gruyter"),
("Walter", "de Gruyter"),
("Marie", "Cram"),
("Anita", "Wöhler"),
]
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def client(): def client():
# Pre-seed the matcher so the lifespan doesn't overwrite it with an empty one.
m = PersonMatcher()
m.load(_TEST_PERSONS)
set_person_matcher(m)
from main import app from main import app
with TestClient(app) as c: with TestClient(app) as c:
yield c yield c
def test_health(client): def test_health(client):
response = client.get("/health") r = client.get("/health")
assert response.status_code == 200 assert r.status_code == 200
assert response.json() == {"status": "ok"} assert r.json()["status"] == "ok"
assert r.json()["persons_loaded"] > 0
def test_parse_returns_200_with_all_fields(client): def test_parse_returns_200_with_all_fields(client):
response = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"}) r = client.post("/parse", json={"query": "Briefe vor 1920", "lang": "de"})
assert response.status_code == 200 assert r.status_code == 200
data = response.json() d = r.json()
assert "personNames" in data assert "personNames" in d
assert "personRole" in data assert d["personRole"] in ("sender", "receiver", "any")
assert data["personRole"] in ("sender", "receiver", "any") assert "dateFrom" in d
assert "dateFrom" in data assert "dateTo" in d
assert "dateTo" in data assert "keywords" in d
assert "keywords" in data assert d["rawQuery"] == "Briefe vor 1920"
assert "rawQuery" in data assert d["dateTo"] == "1920-12-31"
assert data["rawQuery"] == "Briefe vor 1920"
assert data["dateTo"] == "1920-12-31"
def test_parse_person_with_date(client):
r = client.post(
"/parse",
json={"query": "Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "lang": "de"},
)
assert r.status_code == 200
d = r.json()
assert "Clara Cram" in d["personNames"]
assert "Walter de Gruyter" in d["personNames"]
assert d["dateFrom"] == "1920-01-01"
assert d["dateTo"] == "1920-12-31"
def test_parse_unknown_lang_returns_422(client): def test_parse_unknown_lang_returns_422(client):
response = client.post("/parse", json={"query": "test", "lang": "fr"}) r = client.post("/parse", json={"query": "test", "lang": "fr"})
assert response.status_code == 422 assert r.status_code == 422
def test_parse_missing_query_returns_422(client): def test_parse_missing_query_returns_422(client):
response = client.post("/parse", json={"lang": "de"}) r = client.post("/parse", json={"lang": "de"})
assert response.status_code == 422 assert r.status_code == 422
def test_parse_all_languages(client): def test_parse_all_languages(client):
@@ -47,6 +76,6 @@ def test_parse_all_languages(client):
("es", "cartas antes de 1920"), ("es", "cartas antes de 1920"),
] ]
for lang, query in cases: for lang, query in cases:
response = client.post("/parse", json={"query": query, "lang": lang}) r = client.post("/parse", json={"query": query, "lang": lang})
assert response.status_code == 200, f"Failed for lang={lang}" assert r.status_code == 200, f"Failed for lang={lang}"
assert response.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}" assert r.json()["dateTo"] == "1920-12-31", f"Wrong dateTo for lang={lang}"

View File

@@ -0,0 +1,126 @@
# NLP Service — Test Sentences
Real data drawn from the Familienarchiv DB (2026-06-07).
Top persons: Clara Cram, Herbert Cram, Eugenie de Gruyter, Walter de Gruyter, Marie Cram,
Juan Cram, Albert de Gruyter, Hilde de Gruyter, Else Bohrmann, Anita Wöhler, Lili Duvenbeck.
Date range: ~18951945. Key tags: Krieg, Hochzeit, Reise, Geburtstag, Tod, Alltag, Briefwechsel.
---
## German — full sentences
```json
{"query": "Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "lang": "de"}
{"query": "Briefe von Herbert an Eugenie de Gruyter nach 1914", "lang": "de"}
{"query": "Schreiben von Albert de Gruyter an seine Kinder vor 1900", "lang": "de"}
{"query": "Briefe von Juan Cram an Marie zwischen 1915 und 1918", "lang": "de"}
{"query": "Telegramm von Walter de Gruyter an Clara im Jahr 1930", "lang": "de"}
{"query": "Briefe von Else Bohrmann an Herbert Cram nach 1939", "lang": "de"}
```
## German — medium (person + date, no strong role signal)
```json
{"query": "Briefe von Clara Cram vor 1910", "lang": "de"}
{"query": "Dokumente über Walter de Gruyter aus den 1920er Jahren", "lang": "de"}
{"query": "Briefe an Herbert Cram nach dem Krieg", "lang": "de"}
{"query": "Schriften von Eugenie de Gruyter im Jahr 1905", "lang": "de"}
```
## German — short (person only)
```json
{"query": "Briefe an Walter de Gruyter", "lang": "de"}
{"query": "Dokumente über Clara Cram", "lang": "de"}
{"query": "Herbert Cram", "lang": "de"}
{"query": "Anita Wöhler", "lang": "de"}
```
## German — topic only (keywords → tag resolution on Java side)
```json
{"query": "Briefe aus dem Krieg", "lang": "de"}
{"query": "Kriegspost", "lang": "de"}
{"query": "Hochzeitsbriefe", "lang": "de"}
{"query": "Reisebriefe", "lang": "de"}
{"query": "Geburtstagsglückwünsche", "lang": "de"}
{"query": "Briefe über die Hochzeitsreise", "lang": "de"}
{"query": "Kinderbriefe", "lang": "de"}
{"query": "Familienbriefe aus dem Alltag", "lang": "de"}
{"query": "Brautbriefe", "lang": "de"}
{"query": "Kondolenzbriefe nach dem Tod von Eugenie", "lang": "de"}
```
## German — date range only
```json
{"query": "Briefe aus dem Ersten Weltkrieg", "lang": "de"}
{"query": "Dokumente zwischen 1914 und 1918", "lang": "de"}
{"query": "Briefe vor 1900", "lang": "de"}
{"query": "Schriften nach 1920", "lang": "de"}
```
## German — combined (all fields)
```json
{"query": "Briefe von Clara Cram an ihre Kinder über die Reise nach Mexiko im Jahr 1925", "lang": "de"}
{"query": "Kriegspost von Herbert Cram an Eugenie de Gruyter zwischen 1916 und 1918", "lang": "de"}
{"query": "Glückwünsche von Hilde de Gruyter zur Hochzeit im Jahr 1910", "lang": "de"}
{"query": "Kondolenzschreiben an Walter de Gruyter nach dem Tod von Eugenie", "lang": "de"}
```
## English
```json
{"query": "Letters from Clara Cram to Walter de Gruyter in 1920", "lang": "en"}
{"query": "Letters about the war before 1918", "lang": "en"}
{"query": "Letters to Herbert Cram after 1939", "lang": "en"}
{"query": "Birthday greetings from Anita Wöhler", "lang": "en"}
{"query": "Letters between 1914 and 1918", "lang": "en"}
```
## Spanish
```json
{"query": "Cartas de Clara Cram a Walter de Gruyter en 1920", "lang": "es"}
{"query": "Cartas antes de 1900", "lang": "es"}
{"query": "Cartas después de la guerra", "lang": "es"}
{"query": "Cartas de Juan Cram a sus hijos entre 1915 y 1920", "lang": "es"}
```
---
## Edge cases — lazy / missing words / typos
```json
{"query": "Clara", "lang": "de"}
{"query": "Eugenie", "lang": "de"}
{"query": "Herbert", "lang": "de"}
{"query": "de Gruyter", "lang": "de"}
{"query": "Briefe von Klara Kram an Herbert", "lang": "de"}
{"query": "briefe von clara cram an herbert 1920", "lang": "de"}
{"query": "1918", "lang": "de"}
{"query": "1914 1918", "lang": "de"}
{"query": "Krieg", "lang": "de"}
{"query": "Briefe von Eugenie", "lang": "de"}
{"query": "Clara Cram Herbert Cram 1920", "lang": "de"}
{"query": "Wer hat an Herbert Cram 1918 geschrieben?", "lang": "de"}
{"query": "von Clara", "lang": "de"}
{"query": "an Walter", "lang": "de"}
{"query": "Clara 1920", "lang": "de"}
{"query": "Kriegsbriefe von Herbert", "lang": "de"}
{"query": "Briefe von Clara nach Herbert", "lang": "de"}
{"query": "Briefe von Herrbert Cram", "lang": "de"}
```
---
## Known spaCy failures now fixed by DB-backed matcher
| Query | spaCy result | Expected |
|---|---|---|
| `Briefe von Eugenie` | persons=[] | persons=["Eugenie"] |
| `Kriegsbriefe von Herbert` | keywords=["herbert"] | persons=["Herbert"] |
| `Briefe von Herbert an Eugenie de Gruyter nach 1914` | persons=["Herbert an Eugenie de Gruyter"] (merged!) | persons=["Herbert", "Eugenie de Gruyter"] |
| `Letters from Clara Cram to Walter de Gruyter` | persons=[] (EN model doesn't know German names) | persons=["Clara Cram", "Walter de Gruyter"] |
| `Geburtstagsglückwünsche` | persons=["Geburtstagsglückwünsche"] (false positive!) | persons=[] |