Commit Graph

3391 Commits

Author SHA1 Message Date
Marcel
4cbe1dc2d3 feat(search): add NlpExtraction record, NlpClient and NlpHealthClient interfaces
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
324a76d6d2 test(nlp-service): guard global matcher state in try/finally
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
f4e8632e0d chore(nlp-service): add .dockerignore to exclude dev artifacts from image
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
829194f345 chore(nlp-service): remove unused dateparser dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
98ee6cf587 feat(nlp-service): wire NLP_FUZZY_THRESHOLD env var with 0-100 validation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
778382cd61 feat(nlp-service): cap /parse query at 500 chars via Field(max_length=500)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
8d4f30019b feat(nlp-service): log WARNING when DATABASE_URL absent, ERROR on DB failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
d65879d273 fix(nlp-service): return generic 500 detail to prevent credential leakage
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
bda7855cad fix(nlp-service): eliminate false-positive person matches from dirty DB records
- Wire _EXTRA_SPAN_STOPS into _extract_persons_and_role so German function
  words (im, seine, ihre, dem, …) terminate name spans — fixes "Clara im"
  and "seine Kinder" leaking into personNames
- Add _NON_NAME_TOKENS filter in PersonMatcher.load() to skip DB records
  whose first_name contains prepositions or possessives — filters 290 bad
  records (annotations like "an seine Eltern", "Eltern in", place references
  like "Enkel Cram aus Mexiko") that were causing exact Pass-2 matches
- Remove spaCy model downloads from Dockerfile (no longer needed after the
  DB-backed matcher rewrite)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
03d7d44e57 feat(nlp-service): replace spaCy NER with DB-backed PersonMatcher
Rule-based pipeline: persons matched via rapidfuzz against all known
names loaded from DB at startup. Fixes first-name-only extraction
(Eugenie, Herbert), merged-span bug (Herbert + Eugenie de Gruyter),
false positives on compound nouns, and EN/ES model failures.
Date extraction unchanged (regex). No spaCy models required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
aa200bf3c5 feat(nlp-service): Dockerfile — python:3.11-slim, models baked in 2026-06-08 10:56:32 +02:00
Marcel
7404284130 feat(nlp-service): FastAPI app with /parse and /health endpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
3ddb2b278b feat(nlp-service): full extract() pipeline — assembles all steps
Also adds regex year-fallback in extract_dates() for de/es spaCy small
models that don't tag bare 4-digit years as DATE entities, and widens
the direction-token window to 2 tokens back to handle Spanish "antes de".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:56:32 +02:00
Marcel
702a72d575 feat(nlp-service): keyword extraction (POS-filtered, deduped lemmas) 2026-06-08 10:56:32 +02:00
Marcel
3f74deda8c feat(nlp-service): date range extraction with direction detection 2026-06-08 10:56:32 +02:00
Marcel
8ed2a6d95b feat(nlp-service): role detection (sender/receiver/any) 2026-06-08 10:56:32 +02:00
Marcel
deea34c797 feat(nlp-service): NER person name extraction 2026-06-08 10:56:32 +02:00
Marcel
482a1c2863 feat(nlp-service): spaCy model loading with get_nlp/load_all_models 2026-06-08 10:56:32 +02:00
Marcel
8e63867ad8 docs(specs): UI specs for Lesereisen reader and Journey editor
All checks were successful
CI / Unit & Component Tests (push) Successful in 3m23s
CI / OCR Service Tests (push) Successful in 24s
CI / Backend Unit Tests (push) Successful in 4m2s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 21s
CI / Compose Bucket Idempotency (push) Successful in 1m8s
nightly / deploy-staging (push) Successful in 2m44s
lesereisen-reader-spec.html — Issue #752
  LR-0 type selector on /geschichten/new
  LR-1 REISE badge on the list
  LR-2 Journey reader (ordered cards, interlude asides, no position numbers)

lesereisen-editor-spec.html — Issue #753
  LE-1 empty JourneyEditor layout
  LE-2 editor with mixed items (documents + interludes, drag handles)
  LE-3 inline note-editing state
  LE-4 mobile layout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 19:07:34 +02:00
Marcel
6b0a06e8b1 feat(nlp-service): scaffold — models, requirements, CLAUDE.md
Task 1: Create standalone FastAPI service scaffold with models, test framework,
and documentation. Includes ParseRequest, ParseResponse Pydantic models matching
OllamaExtraction contract, plus three passing tests validating model validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 10:11:34 +02:00
Marcel
7c1eef710c docs(nlp): add spaCy NLP service implementation plan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:52:07 +02:00
Marcel
03e22a2f26 docs(nlp): add spaCy NLP service prototype design spec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 09:40:00 +02:00
Marcel
6878419156 merge: resolve conflicts with origin/main (#763 person name-match integration)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m31s
CI / OCR Service Tests (pull_request) Successful in 25s
CI / Backend Unit Tests (pull_request) Successful in 3m48s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
CI / Unit & Component Tests (push) Successful in 3m20s
CI / OCR Service Tests (push) Successful in 23s
CI / Backend Unit Tests (push) Successful in 3m48s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 23s
CI / Compose Bucket Idempotency (push) Successful in 1m8s
- Drop unused MAX_CANDIDATES constant (not referenced in service)
- Keep detached-entity safety comment in resolveTags()
- Add 3 new partial-name match tests (23a/b/c) from #763
- Use resolveByName() API in test 28 (replaces findByDisplayNameContaining)
- Add NameMatches glossary entry from #763

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:50:48 +02:00
Marcel
09b77e9b36 test(person): pin fetchPool dedup when one person matches two tokens (#763 review)
All checks were successful
CI / Unit & Component Tests (push) Successful in 3m20s
CI / OCR Service Tests (push) Successful in 24s
CI / Backend Unit Tests (push) Successful in 3m53s
CI / fail2ban Regex (push) Successful in 44s
CI / Semgrep Security Scan (push) Successful in 21s
CI / Compose Bucket Idempotency (push) Successful in 1m5s
Assert that when the same person id is returned by two different token
fetches, the person appears exactly once in the result -- pinning
fetchPool's putIfAbsent dedup so a future refactor can't silently
double-classify a candidate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
9d202b042b test(person): close fetch-to-classify seam for alias matches on real Postgres (#763 review)
AC#4 (maiden alias -> direct) and AC#5 (alias first name -> fetchable +
classifiable) were each split across PersonRepositoryTest (the fetch) and
PersonServiceTest (the classifier with stubs) -- nothing walked
searchByName -> resolveByName end-to-end on real Postgres. Add two tests
in the existing @DataJpaTest slice that build a real PersonService over
the autowired repositories, persist a person with a MAIDEN_NAME alias and
one with an alias firstName, and assert both classify as direct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
8429b1e9f8 fix(search): derive disambiguation trigger aria-label from match count (#763 review)
The trigger hardcoded the multiple-people label for every count, so a
single did-you-mean picker announced "Mehrere Personen gefunden" to
screen readers while sighted users saw one name and a "Meintest du …?"
heading. Derive the trigger's accessible name from persons.length: a
single suggestion reuses the heading prop, two or more keep the
multiple-people label. Visible truncated name span unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
6959651b36 docs(search): document NameMatches and resolveByName (#763)
GLOSSARY entry for NameMatches (direct vs partial name-match strength and how
the search layer maps it); person/README adds resolveByName to the public
surface. No ADR — the matching rule is localized and justified inline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
0ef4f4f07c feat(search): case-appropriate disambiguation picker copy (#763)
A 1-item picker now reads "Meintest du …?" (a single direct match auto-selects
and never reaches the picker), while ≥2 keeps the "Person auswählen" framing.
The prompt lives in a visible, non-truncated panel heading (the trigger span
clips at 320px), and the "(auswählen…)" cue is dropped for the 1-item case.
DisambiguationPicker takes heading + showCue props; the page derives both from
ambiguousPersons.length. New search_disambiguation_did_you_mean key in de/en/es.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
f1bb9d3a69 feat(search): map direct/partial NameMatches into resolve buckets (#763)
resolveNames now delegates to PersonService.resolveByName and maps by match
strength: 1 direct → resolved (auto-select), ≥2 direct → ambiguous, 0 direct
with partials → ambiguous suggestions, 0 candidates → folded into full-text.
A single direct match no longer forces the picker when looser substring hits
coexist. The MAX_CANDIDATES cap moved into PersonService (after classification);
the MAX_NAME_LENGTH guard, resolved-cap overflow, and sender/receiver mapping
are preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
ca52145556 feat(person): add resolveByName for direct/partial name matching (#763)
Token-set containment over all of a person's name components (firstName,
lastName, alias, each PersonNameAlias first+last, title) decides direct vs
partial. Orchestrates tokenize → cap(8) → fetch pool → classify → cap(10)
after classification, with an empty-token guard and a PII-free debug log of
the outcome bucket. MAX_TOKENS is a DoS control; the after-classify cap keeps a
direct match that sorts past position 10 among partials. Read-only transaction
keeps lazy nameAliases reachable during classification (ADR-022).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
9a26bf75b0 feat(person): match alias first names in searchByName (#763)
The direct-match classifier accepts alias firstName tokens, so the fetch must
surface candidates matchable only via an alias first name. Add a.firstName to
the searchByName LIKE clause (reuses the bound :query — injection-proof). The
person_name_aliases.first_name column already exists; no migration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
9c616f9fb8 feat(person): add name-match tokenizer for direct matching (#763)
Lowercase, split on whitespace/hyphen/apostrophe, drop empties. Applied
symmetrically to query and candidate name components so "Anna-Maria" and
"Anna Maria" tokenize alike. Foundation for resolveByName direct matching.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
0fe0ae5235 docs(search): ADR-028 fix + glossary + C4 diagram for tag resolution (#743)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
2c909f49a8 feat(search): wire theme chip removal to URL navigation in +page.svelte
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
87fd0f39bb feat(search): render removable theme chips in InterpretationChipRow
When tagsApplied is true, each resolvedTag renders as a 'Thema: Name'
chip with optional inline color style from the tag's resolved color.
Clicking × calls onRemoveChip('theme', tag.name).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
7f3ad8ce89 feat(api): add TagHint schema and extend NlQueryInterpretation with resolvedTags/tagsApplied
Manual update since Docker compose backend runs old build; regenerate with
npm run generate:api once new backend is deployed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
aa1f6436cc feat(i18n): add search_chip_theme_prefix to de/en/es message bundles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
b825076733 test(search): DataJpaTest for descendant-expansion via TagRepository
Verifies the recursive CTE in findDescendantIdsByName expands a parent tag
to include all child IDs, and that findByNameContainingIgnoreCase matches
both parent and child names when the fragment appears in both.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
01df815bad test(search): add 11 tag-resolution test cases to NlQueryParserServiceTest
Covers multi-tag match, no-match FTS fallback, mixed resolution, personRole
bypass, cap at 10, short-keyword skip, dedup, rawQuery suppression when all
keywords resolve, flag independence, colour propagation via resolveEffectiveColors,
and colour=null when depth constraint prevents resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
dcd0e725a7 feat(search): implement keyword→tag resolution in NlQueryParserService
Keywords that substring-match the tag taxonomy become OR-union tag filters;
non-matching keywords stay as FTS text. Resolved tags surface in the
NlQueryInterpretation as TagHint objects with effective colours. The
rawQuery fallback is now guarded by hadStructuredMatch to prevent
double-apply when all keywords resolve.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
39ff63921d refactor(search): extract ChipType to chip-types.ts; audit NL fixtures
Pre-implementation step for #743: ChipType union extracted from
InterpretationChipRow and +page.svelte into shared chip-types.ts;
resolvedTags/tagsApplied neutral defaults added to test fixtures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
5a09cd4cb4 feat(search): extend NlQueryInterpretation with resolvedTags + tagsApplied
Positional record fields added; all 3 construction sites updated with neutral
defaults; NlQueryParserService wired for TagService (4th constructor arg);
NlQueryParserServiceTest and NlSearchControllerTest synced.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
4e0ebc72c8 feat(search): add TagHint record for NL tag resolution API surface
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
0f0d89702d feat(search): add TagService.findByNameContaining for NL tag resolution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 08:47:47 +02:00
Marcel
fb41affd4c docs(search): note vitest-browser workaround for + in path
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m22s
CI / OCR Service Tests (pull_request) Successful in 24s
CI / Backend Unit Tests (pull_request) Successful in 3m47s
CI / fail2ban Regex (pull_request) Successful in 46s
CI / Semgrep Security Scan (pull_request) Successful in 24s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m6s
Addresses @Sara review: browser tests in this spec fail silently when
the project path contains '+' (common in git worktrees). The comment
tells developers to copy the frontend directory to a clean path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 00:58:36 +02:00
Marcel
dc366ed403 docs(search): add detached-entity safety comment in resolveTags
Addresses @Markus review: tags fetched by findByNameContaining live outside
any transaction; Hibernate's dirty-check never fires on them. The comment
removes the ambiguity for cold readers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 00:58:03 +02:00
Marcel
64b7b2315d docs(search): ADR-028 fix + glossary + C4 diagram for tag resolution (#743)
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m25s
CI / OCR Service Tests (pull_request) Successful in 23s
CI / Backend Unit Tests (pull_request) Successful in 4m1s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 23s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m7s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 23:42:23 +02:00
Marcel
2a7e133717 feat(search): wire theme chip removal to URL navigation in +page.svelte
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 23:40:33 +02:00
Marcel
5387bc9247 feat(search): render removable theme chips in InterpretationChipRow
When tagsApplied is true, each resolvedTag renders as a 'Thema: Name'
chip with optional inline color style from the tag's resolved color.
Clicking × calls onRemoveChip('theme', tag.name).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 23:33:53 +02:00
Marcel
847874abb3 feat(api): add TagHint schema and extend NlQueryInterpretation with resolvedTags/tagsApplied
Manual update since Docker compose backend runs old build; regenerate with
npm run generate:api once new backend is deployed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 23:01:11 +02:00