- Wire _EXTRA_SPAN_STOPS into _extract_persons_and_role so German function
words (im, seine, ihre, dem, …) terminate name spans — fixes "Clara im"
and "seine Kinder" leaking into personNames
- Add _NON_NAME_TOKENS filter in PersonMatcher.load() to skip DB records
whose first_name contains prepositions or possessives — filters 290 bad
records (annotations like "an seine Eltern", "Eltern in", place references
like "Enkel Cram aus Mexiko") that were causing exact Pass-2 matches
- Remove spaCy model downloads from Dockerfile (no longer needed after the
DB-backed matcher rewrite)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also adds regex year-fallback in extract_dates() for de/es spaCy small
models that don't tag bare 4-digit years as DATE entities, and widens
the direction-token window to 2 tokens back to handle Spanish "antes de".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Task 1: Create standalone FastAPI service scaffold with models, test framework,
and documentation. Includes ParseRequest, ParseResponse Pydantic models matching
OllamaExtraction contract, plus three passing tests validating model validation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>