feat: upgrade search from ILIKE to PostgreSQL full-text search #222
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Document search currently uses
ILIKE %query%which is naive — no stemming, no relevance ranking, no boolean operators. "Briefe" won't match "Brief", there's no way to combine terms with AND/OR, and results aren't ordered by relevance.Proposal
Switch to PostgreSQL's built-in full-text search using
tsvector/tsquerywith the German dictionary (germanconfig).What this enables
ts_rank()to order results by match qualityImplementation sketch
search_vector tsvectorcolumn todocuments, populate it from title, summary, transcription, location, sender/receiver namesCREATE INDEX idx_documents_search ON documents USING GIN (search_vector)search_vectorin sync when documents are updatedtsquery-based querytranscription_blockstable)Open questions
transcriptionfield vs.transcription_blocks— index both?🏗️ Markus Keller — Application Architect
Architecture review covering the open questions in the issue, plus the user's additional goal of making transcription text searchable as a first-class feature.
Resolved
1. PostgreSQL FTS is sufficient — no Elasticsearch needed
tsvector/tsquerywith thegermandictionary covers stemming, relevance ranking, boolean operators, prefix and phrase matching — everything this archive needspg_trgmtrigram search as a fuzzy fallback — still within PostgreSQL, zero additional infrastructure2. Single
search_vectorondocuments, aggregated from metadata +transcription_blocksdocumentsis the natural search unit — users search for documents, not individual blocksdocumentsmetadata +transcription_blocks.textjoined at trigger timetranscriptionfield excluded — it will be deleted;transcription_blocksis the canonical source3. DB trigger on
documentsandtranscription_blocks— not@PreUpdateTranscriptionService, bypassing any JPA lifecycle hooks onDocumenttranscription_blocksfires on every insert/update/delete regardless of write source — the only approach that stays consistent4. Weight assignment
titlesummary,transcription_blocks.texttags,locationTranscription text stays at B (not A) because OCR output can be noisy; promoting it to A would let recognition errors distort rankings.
5. Single
germanconfig for all documentslanguagecolumn adds trigger complexity and a field to maintain — not justified until there's a real multi-language use caseOverall: the proposal is architecturally sound and self-contained within PostgreSQL. No new infrastructure. The trigger-based sync and single-vector-per-document design are the right calls for this stack and scale.
👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
TDD approach for triggers and FTS queries
postgres:16-alpineincludes thegermandictionary), but it requires careful fixture setup. The key integration test: insert atranscription_block→ assertdocuments.search_vectoris updated. That's real-Postgres territory, not a unit test.Specification, a@Query, a native query? That determines how invasive the replacement is. If it's in aSpecification, the new tsquery path likely needs a native@Queryinstead, which is a structural change worth flagging up front.JPA / Hibernate type mapping
tsvectoris. Thesearch_vectorcolumn needs@Column(columnDefinition = "tsvector"). Since the trigger owns writes, the entity field should be@Generated(INSERT)(or just excluded from writes) and mapped toStringfor read-back. Otherwise Hibernate will try to write it on everysave()and fail.Documententities or a projection? Full entities with lazy-loaded relations on a large result set will cause N+1 problems. If the search page only needs title, date, and sender, a projection DTO is safer.API / error handling
to_tsquery('german', userInput)throws a Postgres syntax error on malformed input (unclosed parens, stray&). That becomes a 500 unless caught and mapped toDomainException.badRequest(). What's the error handling strategy for invalid query syntax?ts_rank()scores are added to responses,npm run generate:apineeds to run and TypeScript types will change.Suggestions
websearch_to_tsquery('german', ?)for user-facing input — it never throws on malformed input and understandsAND,OR,-(NOT), and"phrase"natively, without exposing tsquery syntax to users.ORDER BY ts_rank(...) DESCand avoid touching the TypeScript layer.should_find_document_by_stem_when_searching_inflected_form,should_rank_title_match_above_transcription_match,should_return_empty_when_query_contains_only_stop_words.🧪 Sara Holt — QA Engineer
Questions & Observations
Missing acceptance criteria
germanconfig? Concrete examples: "searching 'Briefe' must return documents containing 'Brief'" is testable. "Stemming works" is not.Test coverage plan
transcription_block→ verifydocuments.search_vectoris updated; delete atranscription_block→ verify vector is rebuilt."und","der","die") — Postgres drops them silently, returning no results"Müller","straße"must work with thegermanconfigtranscription_blocks— vector should still work from title/summaryInitial backfill path never tested in CI
search_vectorfor existing rows. If the CI database starts empty, this backfill path is never exercised. Add a migration test that inserts a document before the FTS migration runs and asserts the vector is populated after all migrations complete.Suggestions
ts_rank()ordering is non-deterministic for equal-ranked results. Add a secondary sort (e.g.document_date DESC) so result order is stable and test assertions don't depend on Postgres internals.should_return_empty_result_when_query_contains_only_stop_wordsas an explicit test — this is the most likely user-visible surprise and it's easy to miss without a targeted test.🔐 Nora Steiner — Security Engineer
Questions & Observations
FTS query injection surface
to_tsquery('german', userInput)throws a Postgres syntax error on malformed input (((((,&|, unclosed parens). This is a denial-of-service surface: any unauthenticated or authenticated user can trigger repeated 500 errors with trivial input. It's not SQL injection (no data exfiltration), but it's a reliability issue with a security smell.websearch_to_tsquery('german', ?)orplainto_tsquery('german', ?)— both never throw on malformed input.websearch_to_tsquery()supportsAND,OR,-,"phrase"in Google syntax without exposing raw tsquery operators.plainto_tsquery()treats everything as literal terms.Authorization boundary — routine confirmation
@RequirePermission(Permission.READ_ALL)is already on the search controller), but the PR should explicitly confirm this is retained when the repository method is replaced. A@WebMvcTestthat asserts 403 for a user withoutREAD_ALLwould cover this permanently.Ranking as an information channel
ts_rank()returns a float. If rank scores are exposed in the API response, a user can infer content they haven't read: a high score means a strong match, revealing information about document text. For a family archive this is likely acceptable, but it's an implicit decision worth making explicit. If scores are backend-only (used forORDER BY), the concern disappears.ts_headline()snippets are ever added to search results, they return fragments of raw transcription text. Those fragments must respect the same visibility rules as full documents — something to call out in a future issue rather than implement as an afterthought.Suggestions
websearch_to_tsquery()as the default. It's the safest option for user-facing input and gives users familiar boolean semantics.should_not_throw_500_when_query_contains_invalid_tsquery_syntax. Ifwebsearch_to_tsquery()is used, this test verifies graceful handling; if someone later changes it toto_tsquery(), the test catches it immediately.@RequirePermission(READ_ALL)is retained on the search endpoint — one line in the PR description, one permanent test.🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
This issue is backend-focused, but search is the primary user interaction in the archive. A few UX questions worth resolving before shipping.
Behavior changes that will surprise users
der,die,das,und,von) are silently dropped by thegermanconfig. A user typing "die Briefe von Oma" gets results for "Brief" and "Oma" only — but doesn't know whydieandvonwere ignored. The current empty-state message "Keine Dokumente gefunden" gives no hint.Discoverability of new capabilities
Grund:*,Brief & Oma), users have no way to discover them. If they're available viawebsearch_to_tsquery()syntax (Grundstück OR Grundbuch,-Oma), that's more natural but still undiscovered without a hint.Opportunities worth noting for a follow-up
ts_headline()would make results dramatically more useful for a document archive. Users scanning for "Grundstück" can see the sentence where it appears without opening each document. Worth a separate issue.Suggestions
AND,OR,-um Begriffe zu kombinieren." — even if only added post-launch.websearch_to_tsquery()supports.🛠️ Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
Initial Flyway migration performance
search_vectorand backfills existing rows runs inside a transaction. On a live database with hundreds of documents and thousands oftranscription_blocks, this transaction holds table locks for the duration of the backfill — potentially blocking application writes during deploy. Has this been tested against a database with realistic data volume?ADD COLUMN search_vector tsvector) and a second for the data backfill (UPDATE documents SET search_vector = ...). The backfill can also run outside Flyway's default transaction boundary by setting@NonTransactionalon the migration, avoiding lock contention.germandictionary availability in Testcontainerspostgres:16-alpine. Alpine-based PostgreSQL images may not include all language dictionaries by default — thegermanconfig requirespostgresql-contrib(specifically thetsearch_datafiles). If the dictionary is missing, FTS queries fail withtext search configuration "german" does not existand integration tests will fail in CI but pass locally if developers use a full Debian-based image.SELECT cfgname FROM pg_ts_config WHERE cfgname = 'german'assertion in the migration integration test, or just run the Testcontainers suite once in CI before writing all the tests.GIN index size
tsvectorcolumn can add 20–50% to the indexed table size depending on content density. With transcription text from OCR, the vectors will be large. Worth baseliningpg_database_size()andpg_indexes_size('documents')before and after the migration in the deploy notes, so any unexpected growth is caught rather than discovered weeks later when backup times increase.Trigger overhead on OCR runs
transcription_blockinserts per OCR run, each firing the trigger to rebuild the fulltsvector. For a 100-block document that's 100 recalculations in one transaction. At current scale this is fine, but it would be useful to log the total time spent rebuilding the vector during an OCR run (or emit a metric) so degradation is visible if document count grows.Suggestions
germandictionary presence in the Testcontainers setup early — before writing trigger tests — to avoid a mysterious CI failure that wastes half a day.pg_database_sizeandpg_indexes_sizeto the PR deploy checklist as pre/post metrics.👨💻 Felix Brandt — Implementation Discussion
Working through the open implementation questions before coding starts. All six items resolved.
Resolved
1. ILIKE query location
The text search lives in
DocumentSpecifications.hasText()— a JPA Criteria API Specification. The Criteria API can't expresssearch_vector @@ tsquery, sohasText()is replaced via Option C (two-phase):List<UUID>from the FTS queryhasText()is replaced byhasIds(List<UUID>)— a simpleroot.get("id").in(ids)predicateisBetween,hasSender,hasTags, etc.) remain unchanged2. Query function:
websearch_to_tsqueryUse
websearch_to_tsquery('german', ?)for all user input. Never throws on malformed input (no 500 risk). SupportsAND,OR,-,"phrase"in Google syntax without exposing raw tsquery operators to users.3. JPA entity mapping
search_vectoris not mapped in theDocumententity. The trigger owns it; JPA never touches it. The native FTS query reads it directly in SQL.4. Search result type
Keep
List<Document>(full entities) — no projection DTO in this PR. Add@EntityGraphon thefindAllcall to eager-loadsender,receivers, andtagsand prevent N+1. Projection DTO is a future optimization.5. Sort: RELEVANCE
Add
RELEVANCEtoDocumentSortenum. Becomes the default whentextis present and no explicitsortparam was passed. Rank order preserved via in-memory sort:ids.indexOf(doc.getId()). Native FTS query usesORDER BY ts_rank(...) DESC, document_date DESCfor stable tiebreaking.6. TDD starting point
New
DocumentFtsTestinrepository/, using the existing@DataJpaTest+@Import({PostgresContainerConfig.class, FlywayConfig.class})pattern (same asDocumentSpecificationsTest). First failing test: insert atranscription_block→ assertsearch_vector @@ plainto_tsquery('german', ...)returns 1 row. Write the trigger SQL to make it green.Additional finding: tag hierarchy (issue #221)
No blocking dependency between #221 and #222. The FTS trigger indexes tag names at weight D (
string_agg(t.name, ' ')); it reads onlyt.nameand is unaffected by anyparent_idcolumn #221 may add. Intentional decision: the trigger does not inject parent tag names into the vector — hierarchical tag navigation is handled by thehasTags()Specification filter, not by text search. This keeps the two features independent.Overall read
Implementation path is clear. The two-phase ID approach keeps the existing filter Specifications intact,
websearch_to_tsqueryremoves the only security/reliability risk, and the existing Testcontainers infrastructure means the trigger test can be written immediately. No new infrastructure needed for this PR.Implementierung abgeschlossen ✅
Branch:
feat/issue-222-fts-searchWas wurde umgesetzt
Migration V34 (
efeb913) — DDL:search_vector tsvector-Spalte + GIN-Index aufdocuments, Trigger-Funktionfn_documents_fts_update()(BEFORE INSERT/UPDATE, berechnet gewichteten Vektor: A=Titel, B=Zusammenfassung+Transkription, C=Sender/Empfänger-Namen, D=Tags+Ort) sowie AFTER-Trigger auftranscription_blocks,document_receiversunddocument_tags, die bei Änderungen das Elterndokument neu indizieren.Migration V35 (
3421f32) — Backfill:UPDATE documents SET title = titlefeuert den BEFORE-UPDATE-Trigger für alle existierenden Zeilen und befülltsearch_vectormit dem aktuellen Stand aller Join-Tabellen.Java-Änderungen (
349f74d):DocumentSortumRELEVANCEerweitertDocumentRepository.findRankedIdsByFts()— native Query mitwebsearch_to_tsquery('german', ?)+ts_rank-SortierungDocumentSpecifications:hasText()entfernt,hasIds(List<UUID>)hinzugefügtDocumentService.searchDocuments(): zweistufige Suche — FTS liefert gerankteList<UUID>, Specification filtert verbleibende Kriterien; RELEVANCE ist Standard, wenn Text vorhandenTests
12 neue Integrationstests in
DocumentFtsTest:Alle 895 Tests grün. JAR-Build erfolgreich.
Commits
efeb913feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34)349f74dfeat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort3421f32feat(fts): backfill search_vector for all existing documents (V35)