Commit Graph

18 Commits

Author SHA1 Message Date
Marcel
00b2d46424 test(nlp-service): guard global matcher state in try/finally
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:50:32 +02:00
Marcel
d3da3b6cd1 chore(nlp-service): add .dockerignore to exclude dev artifacts from image
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:50:01 +02:00
Marcel
24e5ac9c22 chore(nlp-service): remove unused dateparser dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:49:37 +02:00
Marcel
2eb5572d7a feat(nlp-service): wire NLP_FUZZY_THRESHOLD env var with 0-100 validation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:48:57 +02:00
Marcel
99d6a9a428 feat(nlp-service): cap /parse query at 500 chars via Field(max_length=500)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:47:40 +02:00
Marcel
4697f5fbb3 feat(nlp-service): log WARNING when DATABASE_URL absent, ERROR on DB failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:47:03 +02:00
Marcel
5d8ec38474 fix(nlp-service): return generic 500 detail to prevent credential leakage
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:46:24 +02:00
Marcel
824f048640 fix(nlp-service): eliminate false-positive person matches from dirty DB records
- Wire _EXTRA_SPAN_STOPS into _extract_persons_and_role so German function
  words (im, seine, ihre, dem, …) terminate name spans — fixes "Clara im"
  and "seine Kinder" leaking into personNames
- Add _NON_NAME_TOKENS filter in PersonMatcher.load() to skip DB records
  whose first_name contains prepositions or possessives — filters 290 bad
  records (annotations like "an seine Eltern", "Eltern in", place references
  like "Enkel Cram aus Mexiko") that were causing exact Pass-2 matches
- Remove spaCy model downloads from Dockerfile (no longer needed after the
  DB-backed matcher rewrite)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 11:09:35 +02:00
Marcel
6c5cf8ec9b feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher
Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 11:00:03 +02:00
Marcel
9472d8c25e feat(nlp-service): Dockerfile — python:3.11-slim, models baked in 2026-06-07 10:31:18 +02:00
Marcel
8521e6f173 feat(nlp-service): FastAPI app with /parse and /health endpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 10:29:32 +02:00
Marcel
cc4c81e218 feat(nlp-service): full extract() pipeline — assembles all steps
Also adds regex year-fallback in extract_dates() for de/es spaCy small
models that don't tag bare 4-digit years as DATE entities, and widens
the direction-token window to 2 tokens back to handle Spanish "antes de".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 10:28:40 +02:00
Marcel
55f419d20f feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas) 2026-06-07 10:24:35 +02:00
Marcel
53f6dcbfed feat(nlp-service): date range extraction with direction detection 2026-06-07 10:23:33 +02:00
Marcel
0ab2e2a743 feat(nlp-service): role detection (sender/receiver/any) 2026-06-07 10:22:14 +02:00
Marcel
bff16f6f1f feat(nlp-service): NER person name extraction 2026-06-07 10:21:16 +02:00
Marcel
18f028e2dd feat(nlp-service): spaCy model loading with get_nlp/load_all_models 2026-06-07 10:17:07 +02:00
Marcel
e3b8e57746 feat(nlp-service): scaffold — models, requirements, CLAUDE.md
Task 1: Create standalone FastAPI service scaffold with models, test framework,
and documentation. Includes ParseRequest, ParseResponse Pydantic models matching
OllamaExtraction contract, plus three passing tests validating model validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 10:13:08 +02:00