fix(nlp-service): eliminate false-positive person matches from dirty DB records

- Wire _EXTRA_SPAN_STOPS into _extract_persons_and_role so German function
  words (im, seine, ihre, dem, …) terminate name spans — fixes "Clara im"
  and "seine Kinder" leaking into personNames
- Add _NON_NAME_TOKENS filter in PersonMatcher.load() to skip DB records
  whose first_name contains prepositions or possessives — filters 290 bad
  records (annotations like "an seine Eltern", "Eltern in", place references
  like "Enkel Cram aus Mexiko") that were causing exact Pass-2 matches
- Remove spaCy model downloads from Dockerfile (no longer needed after the
  DB-backed matcher rewrite)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-06-07 11:09:35 +02:00
parent 6c5cf8ec9b
commit 824f048640
3 changed files with 37 additions and 6 deletions

View File

@@ -9,11 +9,6 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Bake models into the image — no volume needed, ~350 MB total
RUN python -m spacy download de_core_news_sm \
&& python -m spacy download en_core_web_sm \
&& python -m spacy download es_core_news_sm
COPY . .
RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1001 nlp \