feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher

Rule-based pipeline: persons matched via rapidfuzz against all known names loaded from DB at startup. Fixes first-name-only extraction (Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter), false positives on compound nouns, and EN/ES model failures. Date extraction unchanged (regex). No spaCy models required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 11:00:03 +02:00
parent 9472d8c25e
commit 6c5cf8ec9b
8 changed files with 939 additions and 551 deletions
--- a/nlp-service/CLAUDE.md
+++ b/nlp-service/CLAUDE.md
@@ -5,23 +5,30 @@ replacing Ollama for the Familienarchiv NL search feature.

 ## Stack

- Python 3.11, FastAPI 0.115, spaCy 3.8, dateparser 1.2
+- Python 3.11, FastAPI 0.115, rapidfuzz 3.x, dateparser 1.2, psycopg2-binary
+
+No ML models — persons are matched against the live DB via fuzzy lookup.

 ## Endpoints

 - `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract
- `GET /health` — returns `{"status": "ok"}` when all models are loaded
+- `GET /health` — returns `{"status": "ok", "persons_loaded": N}`

 ## Running locally

 ```bash
 pip install -r requirements.txt
-python -m spacy download de_core_news_sm en_core_web_sm es_core_news_sm
+
+# Without DB (empty person matcher — dates and keywords still work):
 uvicorn main:app --reload --port 8001

+# With DB (full person matching):
+DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
+  uvicorn main:app --reload --port 8001
+
 curl -X POST http://localhost:8001/parse \
  -H "Content-Type: application/json" \
-  -d '{"query": "Briefe von Opa Hermann an Marie vor 1920", "lang": "de"}'
+  -d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'
 ```

 ## Testing
@@ -30,6 +37,14 @@ curl -X POST http://localhost:8001/parse \
 pytest -v
 ```

+No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.
+
+## Architecture
+
+- `person_matcher.py` — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
+- `extractor.py` — pipeline: persons → role → dates (regex) → keywords (stopword filter)
+- `main.py` — FastAPI app; reads `DATABASE_URL` env var at startup
+
 ## Design spec

 See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
@@ -39,3 +54,5 @@ See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
 This is a **prototype** for extraction quality evaluation. No docker-compose integration or
 Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in
 `backend/src/main/java/org/raddatz/familienarchiv/search/`.
+
+Test sentences for manual evaluation are in `test_sentences.md`.