Removes 'dateparser 1.2' from the stack section (dependency was dropped in favour of the rule-based date regex pipeline). Rewrites the Notes section to reflect that docker-compose integration and Java-side wiring were both delivered in this PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
60 lines
1.9 KiB
Markdown
60 lines
1.9 KiB
Markdown
# NLP Service
|
|
|
|
Lightweight FastAPI service that parses free-text search queries into structured extractions,
|
|
replacing Ollama for the Familienarchiv NL search feature.
|
|
|
|
## Stack
|
|
|
|
- Python 3.11, FastAPI 0.115, rapidfuzz 3.x, psycopg2-binary
|
|
|
|
No ML models — persons are matched against the live DB via fuzzy lookup.
|
|
|
|
## Endpoints
|
|
|
|
- `POST /parse` — parse a free-text query, return extraction matching `NlpExtraction` contract
|
|
- `GET /health` — returns `{"status": "ok", "persons_loaded": N}`
|
|
|
|
## Running locally
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
|
|
# Without DB (empty person matcher — dates and keywords still work):
|
|
uvicorn main:app --reload --port 8001
|
|
|
|
# With DB (full person matching):
|
|
DATABASE_URL=postgresql://archive_user:secret@localhost:5432/family_archive_db \
|
|
uvicorn main:app --reload --port 8001
|
|
|
|
curl -X POST http://localhost:8001/parse \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "Briefe von Clara Cram an Walter de Gruyter vor 1920", "lang": "de"}'
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
pytest -v
|
|
```
|
|
|
|
No DB required for tests — fixture pre-seeds the PersonMatcher with a small test corpus.
|
|
|
|
## Architecture
|
|
|
|
- `person_matcher.py` — DB-backed name lookup: loads all persons at startup, fuzzy-matches query tokens after person prepositions
|
|
- `extractor.py` — pipeline: persons → role → dates (regex) → keywords (stopword filter)
|
|
- `main.py` — FastAPI app; reads `DATABASE_URL` env var at startup
|
|
|
|
## Design spec
|
|
|
|
See `docs/superpowers/specs/2026-06-07-spacy-nlp-service-design.md`.
|
|
|
|
## Notes
|
|
|
|
This service is fully wired into `docker-compose.yml` (container `archive-nlp`, port 8001
|
|
internal-only) and the Java search path (`RestClientNlpClient` → `NlQueryParserService` →
|
|
`NlSearchController`). The extraction contract matches `NlpExtraction` in
|
|
`backend/src/main/java/org/raddatz/familienarchiv/search/`.
|
|
|
|
Test sentences for manual evaluation are in `test_sentences.md`.
|