familienarchiv/nlp-service/CLAUDE.md

# NLP Service

Lightweight FastAPI service that parses free-text search queries into structured extractions,
replacing Ollama for the Familienarchiv NL search feature.

## Stack

- Python 3.11, FastAPI 0.115, rapidfuzz 3.x, dateparser 1.2, psycopg2-binary

No ML models — persons are matched against the live DB via fuzzy lookup.

## Endpoints

- `POST /parse` — parse a free-text query, return extraction matching `OllamaExtraction` contract
- `GET /health` — returns `{"status": "ok", "persons_loaded": N}`

## Running locally

```bash
pip install -r requirements.txt

# Without DB (empty person matcher — dates and keywords still work):
uvicorn main:app --reload --port 8001

# With DB (full person matching):
DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
  uvicorn main:app --reload --port 8001

curl -X POST http://localhost:8001/parse \
  -H "Content-Type: application/json" \
  -d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'
```

## Testing

```bash
pytest -v
```

No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.

## Architecture

- `person_matcher.py` — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
- `extractor.py` — pipeline: persons → role → dates (regex) → keywords (stopword filter)
- `main.py` — FastAPI app; reads `DATABASE_URL` env var at startup

## Design spec

See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.

## Notes

This is a **prototype** for extraction quality evaluation. No docker-compose integration or
Java-side changes in this iteration. The extraction contract matches `OllamaExtraction` in
`backend/src/main/java/org/raddatz/familienarchiv/search/`.

Test sentences for manual evaluation are in `test_sentences.md`.