Files
familienarchiv/nlp-service/CLAUDE.md
Marcel 03d7d44e57 feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher
Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00

1.8 KiB

NLP Service

Lightweight FastAPI service that parses free-text search queries into structured extractions, replacing Ollama for the Familienarchiv NL search feature.

Stack

  • Python 3.11, FastAPI 0.115, rapidfuzz 3.x, dateparser 1.2, psycopg2-binary

No ML models — persons are matched against the live DB via fuzzy lookup.

Endpoints

  • POST /parse — parse a free-text query, return extraction matching OllamaExtraction contract
  • GET /health — returns {"status": "ok", "persons_loaded": N}

Running locally

pip install -r requirements.txt

# Without DB (empty person matcher — dates and keywords still work):
uvicorn main:app --reload --port 8001

# With DB (full person matching):
DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
  uvicorn main:app --reload --port 8001

curl -X POST http://localhost:8001/parse \
  -H "Content-Type: application/json" \
  -d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'

Testing

pytest -v

No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.

Architecture

  • person_matcher.py — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
  • extractor.py — pipeline: persons → role → dates (regex) → keywords (stopword filter)
  • main.py — FastAPI app; reads DATABASE_URL env var at startup

Design spec

See docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md.

Notes

This is a prototype for extraction quality evaluation. No docker-compose integration or Java-side changes in this iteration. The extraction contract matches OllamaExtraction in backend/src/main/java/org/raddatz/familienarchiv/search/.

Test sentences for manual evaluation are in test_sentences.md.