feat(search): upgrade to PostgreSQL full-text search with German stemming #237
Reference in New Issue
Block a user
Delete Branch "feat/issue-222-fts-search"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #222
Summary
ILIKE %query%with PostgreSQLtsvector/websearch_to_tsquery('german', ?)full-text searchsearch_vector tsvectorcolumn (GIN-indexed) ondocuments, populated by a BEFORE INSERT/UPDATE triggertranscription_blocks,document_receivers,document_tagskeep the vector in syncList<UUID>→Specificationfilters remaining criteriaRELEVANCEadded toDocumentSort; becomes the default sort when a search term is presentwebsearch_to_tsqueryprevents 500 errors on malformed input ((((, stop-word-only queries)Migrations
UPDATE documents SET title = titlerebuilds vector for all existing rowsTest plan
DocumentFtsTest(Testcontainers, postgres:16-alpine)🤖 Generated with Claude Code
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Blockers
useRankOrdersilently overrides explicit DATE sort (DocumentService.java)When a user explicitly selects
sort=DATE, the code re-sorts by FTS rank instead. That's unexpected — the user asked for chronological order and got relevance. Consequence: the sort picker in the frontend is silently ignored whenever a text query is active.Fix: treat null and RELEVANCE as rank order, but let DATE sort chronologically:
Suggestions
ids.indexOf()is O(n) per document — O(n²) overallFor a family archive this is fine. At scale, swap for a
Map<UUID, Integer>:Missing sender name search test
DocumentFtsTesttests receiver name search (weight C) but not sender name. A sender is also indexed at weight C:Add
should_find_document_by_sender_name()for symmetry.Test setup: annotation helper adds noise
Each transcription block test requires creating a
DocumentAnnotationfirst (needed for the FK). Theannotation()helper is clean and well-named — no issue there. Good factory pattern overall.What's solid:
should_rebuild_vector_when_transcription_block_deleted) ✅document(String)factory avoids builder repetition ✅@BeforeEachdelete order correct: blocks → docs → persons/tags ✅(root, query, cb) -> nullfor Spring Data 3.xwhere(null)workaround is clean ✅websearch_to_tsquerysafety test passes ✅🏛️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
Concerns
The "boring technology" choice is excellent — PostgreSQL FTS is the right call. No Elasticsearch, no external service, no new infrastructure. Exactly the kind of decision I'd make.
DATE sort semantics conflict with RELEVANCE (same as Felix's blocker)
When
sort == DocumentSort.DATEand text is present, the service silently applies rank ordering. This conflates two user intents:The fix is behavioral, not architectural — but it's a correctness issue that will confuse frontend developers building the sort picker.
In-memory sort over potentially large result sets
The results are bounded by the FTS result count (which is already filtered by
hasIds(rankedIds)), so this is not an unbounded memory load. For a family archive with hundreds of documents, fine. Worth noting for future reference.Architecture: trigger-based sync is the right pattern here ✅
The trigger approach correctly decouples the OCR pipeline (which writes
transcription_blocksdirectly without going through JPA) from the search index. A@PreUpdatehook would have missed those writes entirely. Good architectural reasoning.The
SET title = titleno-op is an established PostgreSQL pattern. The comment explaining why it's needed is essential — keep it.No ADR for a significant architectural decision
Adding a GIN-indexed
tsvectorcolumn with 4 triggers that affect every document write is a meaningful architectural commitment. It should be captured indocs/adr/— at minimum: context, decision, the trigger-cascade approach, and whywebsearch_to_tsqueryoverto_tsquery.This is a suggestion, not a blocker for merge. Write it as a follow-up.
Layer boundaries respected ✅
DocumentServicecallsdocumentRepository.findRankedIdsByFts()— no native SQL in the service layer ✅DocumentSpecifications.hasIds()remains a pure Specification — no business logic ✅resolveSort()handlesRELEVANCEcleanly ✅This fallback is correct when no text is present — RELEVANCE degrades to DATE gracefully.
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Blockers
None. The test suite is solid and uses real PostgreSQL via Testcontainers.
Concerns
Missing sender name search test
Receiver name search is tested (
should_find_document_by_receiver_name). Sender name is also indexed at the same weight (C), but there is no correspondingshould_find_document_by_sender_nametest. This is a symmetry gap — the trigger code for sender indexing is exercised but not validated.No integration test for two-phase search with combined filters
DocumentFtsTesttests the repository's FTS query in isolation. There is no test that exercises the fullsearchDocuments()path: FTS + a Specification filter (e.g. date range, tag, status) applied together. A regression inhasIds()combination or the no-op lambda(root, query, cb) -> nullwould not be caught.Suggested test (using
DocumentServiceor directly viaSpecification.where(textSpec).and(hasStatus(...))):No
@Transactionalon test methodsTests use
@BeforeEach deleteAll()for cleanup instead of rollback. This means each test commits its data, then the next test deletes it — slower and leaves debris if a test crashes mid-run. For@DataJpaTestwith Testcontainers,@Transactionalon test methods would auto-rollback after each test with zero cleanup code.Not a blocker — the
deleteAll()approach is safe and is used in the existingDocumentSpecificationsTesttoo.What's solid
document(String)factory removes builder repetition from every test body ✅should_rebuild_vector_when_transcription_block_deleted✅german_text_search_config_is_availablecatches config drift early ✅assertThatNoException()for the malformed query test — exactly right, tests the safety contract ✅DocumentSpecificationsTestcleanup:hasText_*tests removed without breaking the 25 remaining ✅@BeforeEachis correct:blockRepository.deleteAll()beforedocumentRepository.deleteAll()— respects FK constraints ✅🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
Reviewed for: SQL injection, query injection, input reflection, data exposure, permission bypass, log injection.
Findings
No vulnerabilities found.
What I checked
Native query uses named parameter — parameterized, not concatenated ✅
:queryis a JDBC bind parameter. The user-supplied string is passed as a parameter value, never interpolated into the query string. No SQL injection surface here.websearch_to_tsqueryis the safe FTS query parser ✅to_tsquery('bad input (((')throws a PostgreSQL syntax error.websearch_to_tsquerynever throws — it silently drops invalid tokens. This prevents user-supplied malformed queries from causing 500 errors (CWE-20 mitigation). The testshould_not_throw_when_query_contains_invalid_tsquery_syntaxverifies this contract.Stop-word-only queries return empty results, not an error ✅
websearch_to_tsquery('german', 'der die das')produces an empty tsquery. The@@operator returns false for all rows — 0 results, no exception. Verified byshould_return_empty_when_query_contains_only_stop_words.No user input reflected into responses ✅
The FTS query is used only in the WHERE/ORDER BY clause. The search term is not reflected back into the API response, so no XSS surface from this code path.
No permission changes ✅
searchDocuments()is called throughDocumentController, which already has@RequirePermission(Permission.READ_ALL). The FTS path does not introduce any new endpoint or bypass.No logging of raw user input ✅
The search term is not logged at any point in
DocumentService— no log injection risk (CWE-117).One note (not a finding)
The
search_vectorcolumn is stored in plaintext as a weighted lexeme vector ('brief':1A 'anna':3C). For a family archive this is appropriate — the index reflects the same content as the indexed columns. Worth noting if the project ever adds confidential metadata fields that should not appear in the search index.🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist
Verdict: ✅ Approved
This is a backend-only PR. No Svelte components, no HTML, no CSS, no i18n strings were changed. Nothing to flag in my domain.
What I noted for follow-up
The sort picker will need to expose
RELEVANCEThe new
DocumentSort.RELEVANCEenum value needs a label in the frontend sort dropdown. When it ships:Behavioral change users will notice
Before: searching returned results in date order with partial matches anywhere in the word.
After: searching returns results in relevance order — title matches rank above transcription matches.
This is a UX improvement, but the visual affordance (a sort indicator or label) should communicate that results are ordered by relevance. A "sorted by relevance" label near the result count would close the loop for users wondering why the order changed.
These are observations for follow-up issues, not concerns with this PR.
⚙️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
What I checked
Findings
V35 backfill will table-lock
documentsduring migrationThis is a single statement that touches every row. PostgreSQL acquires a ROW EXCLUSIVE lock on the table for the duration — no reads are blocked (SELECT), but concurrent writes are queued. For a family archive with a few hundred rows this completes in milliseconds. At larger scale (10k+ rows) this would need batching. Fine for current scale.
Every document INSERT/UPDATE now runs 4 subqueries
The BEFORE trigger
fn_documents_fts_update()executes subqueries againsttranscription_blocks,persons(×2),document_receivers,document_tags, andtagon every row write. That's the cost of keeping the vector current. For a family archive workload (low write volume), negligible. The GIN index pays back on every search.No new infrastructure ✅
No new Docker services, volumes, or environment variables. The GIN index is pure PostgreSQL. Flyway handles both migrations in sequence automatically.
Migration split is correct ✅
V34 (DDL) and V35 (backfill) are correctly split into separate files. If V35 fails (e.g. mid-backfill on a large table), Flyway marks it as failed and the DDL stays intact. Backfill can be re-run or applied manually without touching the schema.
Column is nullable — correct for a migration scenario ✅
ALTER TABLE documents ADD COLUMN search_vector tsvector;without NOT NULL is intentional. V34 installs the trigger, V35 backfills. Between the two migrations, some rows haveNULLsearch_vector. The@@operator returns NULL (not true) for NULL vectors, so no incorrect search results during the migration window.Operational summary ✅
No config changes needed. No downtime required beyond the normal migration window. Deploy as usual: Flyway runs V34 → V35 at startup, both complete before the app accepts traffic.
Review concerns addressed
All open concerns from the multi-persona review have been resolved. Here is the full accounting:
Blocker fixed — DATE sort overridden by relevance (Felix + Markus)
Commit:
e621bdd—fix(search): respect DATE sort when text is presentRemoved
|| sort == DocumentSort.DATEfrom theuseRankOrdercondition. Explicit DATE sort now correctly routes throughfindAll(spec, sort)instead of the rank-reordering path.Test added —
searchDocumentswith DATE sort (Felix + Markus)Commit:
e621bdd— same commit, new fileDocumentServiceSortTestThree unit tests using
@ExtendWith(MockitoExtension.class)covering:Refactor — O(n²)
ids.indexOf()replaced with HashMap (Felix)Commit:
5affe21—refactor(search): replace O(n²) indexOf with HashMap for rank orderingRank lookup is now O(1) per document via
Map<UUID, Integer>.Test added — sender name FTS coverage (Felix + Sara)
Commit:
13955a5—test(search): add sender name FTS coverage and combined filter testshould_find_document_by_sender_name— symmetric with the existing receiver name test. Confirms the V34 trigger indexes sender names at weight C.Test added — combined FTS + Specification filter (Sara)
Commit:
13955a5— same commitfts_combined_with_status_filter_excludes_non_matching_status— verifies the two-phase search (hasIds(rankedIds).and(hasStatus(...))) works end-to-end.Out of scope (no code change)
@Transactionalon integration tests — consistent with existing pattern, not a blockerAll 14
DocumentFtsTesttests green. All 3DocumentServiceSortTesttests green.13955a5459to305f95a572