fix(nlp-service): eliminate false-positive person matches from dirty DB records
- Wire _EXTRA_SPAN_STOPS into _extract_persons_and_role so German function words (im, seine, ihre, dem, …) terminate name spans — fixes "Clara im" and "seine Kinder" leaking into personNames - Add _NON_NAME_TOKENS filter in PersonMatcher.load() to skip DB records whose first_name contains prepositions or possessives — filters 290 bad records (annotations like "an seine Eltern", "Eltern in", place references like "Enkel Cram aus Mexiko") that were causing exact Pass-2 matches - Remove spaCy model downloads from Dockerfile (no longer needed after the DB-backed matcher rewrite) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -66,6 +66,22 @@ _DATE_BETWEEN: dict[str, frozenset[str]] = {
|
||||
"es": frozenset({"entre"}),
|
||||
}
|
||||
|
||||
# ── Extra span-termination tokens (function words that cannot be in a name) ──
|
||||
|
||||
_EXTRA_SPAN_STOPS: dict[str, frozenset[str]] = {
|
||||
# German articles, possessives, and particles that end a name span
|
||||
"de": frozenset({
|
||||
"im", "am", "beim", "zum", "zur",
|
||||
"dem", "den", "des",
|
||||
"sein", "seine", "seinen", "seiner",
|
||||
"ihr", "ihre", "ihrem", "ihren", "ihrer",
|
||||
"unser", "unsere", "unseren",
|
||||
"über", "auch", "oder", "und",
|
||||
}),
|
||||
"en": frozenset(),
|
||||
"es": frozenset({"el", "la", "los", "las", "su", "sus", "mi"}),
|
||||
}
|
||||
|
||||
# ── Stopword lists ────────────────────────────────────────────────────────────
|
||||
|
||||
_STOPWORDS: dict[str, frozenset[str]] = {
|
||||
@@ -138,7 +154,7 @@ def _extract_persons_and_role(
|
||||
return [], "any"
|
||||
|
||||
preps = _ALL_PERSON_PREPS[lang]
|
||||
stops = preps | _DATE_BEFORE[lang] | _DATE_AFTER[lang] | _DATE_BETWEEN[lang]
|
||||
stops = preps | _DATE_BEFORE[lang] | _DATE_AFTER[lang] | _DATE_BETWEEN[lang] | _EXTRA_SPAN_STOPS[lang]
|
||||
matches = m.find_in_query(query, preps, stop_tokens=stops)
|
||||
|
||||
person_names = [text for text, _ in matches]
|
||||
|
||||
Reference in New Issue
Block a user