Files
familienarchiv/nlp-service/test_sentences.md
Marcel 03d7d44e57 feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher
Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00

127 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NLP Service — Test Sentences
Real data drawn from the Familienarchiv DB (2026-06-07).
Top persons: Clara Cram, Herbert Cram, Eugenie de Gruyter, Walter de Gruyter, Marie Cram,
Juan Cram, Albert de Gruyter, Hilde de Gruyter, Else Bohrmann, Anita Wöhler, Lili Duvenbeck.
Date range: ~18951945. Key tags: Krieg, Hochzeit, Reise, Geburtstag, Tod, Alltag, Briefwechsel.
---
## German — full sentences
```json
{"query": "Briefe von Clara Cram an Walter de Gruyter im Jahr 1920", "lang": "de"}
{"query": "Briefe von Herbert an Eugenie de Gruyter nach 1914", "lang": "de"}
{"query": "Schreiben von Albert de Gruyter an seine Kinder vor 1900", "lang": "de"}
{"query": "Briefe von Juan Cram an Marie zwischen 1915 und 1918", "lang": "de"}
{"query": "Telegramm von Walter de Gruyter an Clara im Jahr 1930", "lang": "de"}
{"query": "Briefe von Else Bohrmann an Herbert Cram nach 1939", "lang": "de"}
```
## German — medium (person + date, no strong role signal)
```json
{"query": "Briefe von Clara Cram vor 1910", "lang": "de"}
{"query": "Dokumente über Walter de Gruyter aus den 1920er Jahren", "lang": "de"}
{"query": "Briefe an Herbert Cram nach dem Krieg", "lang": "de"}
{"query": "Schriften von Eugenie de Gruyter im Jahr 1905", "lang": "de"}
```
## German — short (person only)
```json
{"query": "Briefe an Walter de Gruyter", "lang": "de"}
{"query": "Dokumente über Clara Cram", "lang": "de"}
{"query": "Herbert Cram", "lang": "de"}
{"query": "Anita Wöhler", "lang": "de"}
```
## German — topic only (keywords → tag resolution on Java side)
```json
{"query": "Briefe aus dem Krieg", "lang": "de"}
{"query": "Kriegspost", "lang": "de"}
{"query": "Hochzeitsbriefe", "lang": "de"}
{"query": "Reisebriefe", "lang": "de"}
{"query": "Geburtstagsglückwünsche", "lang": "de"}
{"query": "Briefe über die Hochzeitsreise", "lang": "de"}
{"query": "Kinderbriefe", "lang": "de"}
{"query": "Familienbriefe aus dem Alltag", "lang": "de"}
{"query": "Brautbriefe", "lang": "de"}
{"query": "Kondolenzbriefe nach dem Tod von Eugenie", "lang": "de"}
```
## German — date range only
```json
{"query": "Briefe aus dem Ersten Weltkrieg", "lang": "de"}
{"query": "Dokumente zwischen 1914 und 1918", "lang": "de"}
{"query": "Briefe vor 1900", "lang": "de"}
{"query": "Schriften nach 1920", "lang": "de"}
```
## German — combined (all fields)
```json
{"query": "Briefe von Clara Cram an ihre Kinder über die Reise nach Mexiko im Jahr 1925", "lang": "de"}
{"query": "Kriegspost von Herbert Cram an Eugenie de Gruyter zwischen 1916 und 1918", "lang": "de"}
{"query": "Glückwünsche von Hilde de Gruyter zur Hochzeit im Jahr 1910", "lang": "de"}
{"query": "Kondolenzschreiben an Walter de Gruyter nach dem Tod von Eugenie", "lang": "de"}
```
## English
```json
{"query": "Letters from Clara Cram to Walter de Gruyter in 1920", "lang": "en"}
{"query": "Letters about the war before 1918", "lang": "en"}
{"query": "Letters to Herbert Cram after 1939", "lang": "en"}
{"query": "Birthday greetings from Anita Wöhler", "lang": "en"}
{"query": "Letters between 1914 and 1918", "lang": "en"}
```
## Spanish
```json
{"query": "Cartas de Clara Cram a Walter de Gruyter en 1920", "lang": "es"}
{"query": "Cartas antes de 1900", "lang": "es"}
{"query": "Cartas después de la guerra", "lang": "es"}
{"query": "Cartas de Juan Cram a sus hijos entre 1915 y 1920", "lang": "es"}
```
---
## Edge cases — lazy / missing words / typos
```json
{"query": "Clara", "lang": "de"}
{"query": "Eugenie", "lang": "de"}
{"query": "Herbert", "lang": "de"}
{"query": "de Gruyter", "lang": "de"}
{"query": "Briefe von Klara Kram an Herbert", "lang": "de"}
{"query": "briefe von clara cram an herbert 1920", "lang": "de"}
{"query": "1918", "lang": "de"}
{"query": "1914 1918", "lang": "de"}
{"query": "Krieg", "lang": "de"}
{"query": "Briefe von Eugenie", "lang": "de"}
{"query": "Clara Cram Herbert Cram 1920", "lang": "de"}
{"query": "Wer hat an Herbert Cram 1918 geschrieben?", "lang": "de"}
{"query": "von Clara", "lang": "de"}
{"query": "an Walter", "lang": "de"}
{"query": "Clara 1920", "lang": "de"}
{"query": "Kriegsbriefe von Herbert", "lang": "de"}
{"query": "Briefe von Clara nach Herbert", "lang": "de"}
{"query": "Briefe von Herrbert Cram", "lang": "de"}
```
---
## Known spaCy failures now fixed by DB-backed matcher
| Query | spaCy result | Expected |
|---|---|---|
| `Briefe von Eugenie` | persons=[] | persons=["Eugenie"] |
| `Kriegsbriefe von Herbert` | keywords=["herbert"] | persons=["Herbert"] |
| `Briefe von Herbert an Eugenie de Gruyter nach 1914` | persons=["Herbert an Eugenie de Gruyter"] (merged!) | persons=["Herbert", "Eugenie de Gruyter"] |
| `Letters from Clara Cram to Walter de Gruyter` | persons=[] (EN model doesn't know German names) | persons=["Clara Cram", "Walter de Gruyter"] |
| `Geburtstagsglückwünsche` | persons=["Geburtstagsglückwünsche"] (false positive!) | persons=[] |