feat: upgrade search from ILIKE to PostgreSQL full-text search #222

New Issue

marcel · 2026-04-12T08:36:18+02:00

marcel commented

2026-04-12 08:36:18 +02:00

Problem

Document search currently uses ILIKE %query% which is naive — no stemming, no relevance ranking, no boolean operators. "Briefe" won't match "Brief", there's no way to combine terms with AND/OR, and results aren't ordered by relevance.

Proposal

Switch to PostgreSQL's built-in full-text search using tsvector / tsquery with the German dictionary (german config).

What this enables

Stemming: "Briefe" matches "Brief", "geschrieben" matches "schreiben"
Relevance ranking: ts_rank() to order results by match quality
Boolean operators: AND/OR/NOT between terms
Prefix matching: "Grund:*" matches "Grundstück", "Grundbuch"
Phrase matching: "Reichsmark 500" as exact phrase

Implementation sketch

Flyway migration: Add a search_vector tsvector column to documents, populate it from title, summary, transcription, location, sender/receiver names
GIN index: CREATE INDEX idx_documents_search ON documents USING GIN (search_vector)
Trigger or @PreUpdate: Keep search_vector in sync when documents are updated
Repository: Replace ILIKE specification with tsquery-based query
Consider: Whether transcription block text should also feed into the vector (joined from transcription_blocks table)

Open questions

Which fields should feed into the vector, and with what weight? (A = title, B = summary/transcription, C = sender/receiver names, D = tags?)
Should we use a trigger (DB-level, always consistent) or application-level update (more control)?
How to handle the legacy transcription field vs. transcription_blocks — index both?
Multi-language support needed? Most documents are German but some may be in other languages.

## Problem Document search currently uses `ILIKE %query%` which is naive — no stemming, no relevance ranking, no boolean operators. "Briefe" won't match "Brief", there's no way to combine terms with AND/OR, and results aren't ordered by relevance. ## Proposal Switch to PostgreSQL's built-in full-text search using `tsvector` / `tsquery` with the German dictionary (`german` config). ### What this enables - **Stemming**: "Briefe" matches "Brief", "geschrieben" matches "schreiben" - **Relevance ranking**: `ts_rank()` to order results by match quality - **Boolean operators**: AND/OR/NOT between terms - **Prefix matching**: "Grund:*" matches "Grundstück", "Grundbuch" - **Phrase matching**: "Reichsmark 500" as exact phrase ### Implementation sketch 1. **Flyway migration**: Add a `search_vector tsvector` column to `documents`, populate it from title, summary, transcription, location, sender/receiver names 2. **GIN index**: `CREATE INDEX idx_documents_search ON documents USING GIN (search_vector)` 3. **Trigger or @PreUpdate**: Keep `search_vector` in sync when documents are updated 4. **Repository**: Replace ILIKE specification with `tsquery`-based query 5. **Consider**: Whether transcription block text should also feed into the vector (joined from `transcription_blocks` table) ### Open questions - Which fields should feed into the vector, and with what weight? (A = title, B = summary/transcription, C = sender/receiver names, D = tags?) - Should we use a trigger (DB-level, always consistent) or application-level update (more control)? - How to handle the legacy `transcription` field vs. `transcription_blocks` — index both? - Multi-language support needed? Most documents are German but some may be in other languages.

marcel added the feature label 2026-04-12 08:36:25 +02:00

marcel commented

2026-04-14 21:47:18 +02:00

🏗️ Markus Keller — Application Architect

Architecture review covering the open questions in the issue, plus the user's additional goal of making transcription text searchable as a first-class feature.

Resolved

1. PostgreSQL FTS is sufficient — no Elasticsearch needed

tsvector / tsquery with the german dictionary covers stemming, relevance ranking, boolean operators, prefix and phrase matching — everything this archive needs
At family archive scale (thousands of documents, not millions), adding Elasticsearch would be unjustified operational complexity
If OCR noise causes too many missed matches, add pg_trgm trigram search as a fuzzy fallback — still within PostgreSQL, zero additional infrastructure

2. Single search_vector on documents, aggregated from metadata + transcription_blocks

documents is the natural search unit — users search for documents, not individual blocks
One vector, one GIN index, one ranked result set
Source fields: documents metadata + transcription_blocks.text joined at trigger time
Legacy transcription field excluded — it will be deleted; transcription_blocks is the canonical source

3. DB trigger on documents and transcription_blocks — not @PreUpdate

The OCR pipeline writes blocks one at a time via TranscriptionService, bypassing any JPA lifecycle hooks on Document
A trigger on transcription_blocks fires on every insert/update/delete regardless of write source — the only approach that stays consistent
Write pattern confirmed from code: blocks are deleted in bulk, then re-inserted one at a time (30–100 inserts per OCR run); each fires the trigger and rebuilds the vector — acceptable cost at this scale, no debouncing needed

4. Weight assignment

Weight	Field
A	`title`
B	`summary`, `transcription_blocks.text`
C	sender / receiver names
D	`tags`, `location`

Transcription text stays at B (not A) because OCR output can be noisy; promoting it to A would let recognition errors distort rankings.

5. Single german config for all documents

Most documents are German; non-German documents are still indexed and findable by exact token even without correct stemming
Per-document language column adds trigger complexity and a field to maintain — not justified until there's a real multi-language use case

Overall: the proposal is architecturally sound and self-contained within PostgreSQL. No new infrastructure. The trigger-based sync and single-vector-per-document design are the right calls for this stack and scale.

## 🏗️ Markus Keller — Application Architect Architecture review covering the open questions in the issue, plus the user's additional goal of making transcription text searchable as a first-class feature. --- ### Resolved **1. PostgreSQL FTS is sufficient — no Elasticsearch needed** - `tsvector` / `tsquery` with the `german` dictionary covers stemming, relevance ranking, boolean operators, prefix and phrase matching — everything this archive needs - At family archive scale (thousands of documents, not millions), adding Elasticsearch would be unjustified operational complexity - If OCR noise causes too many missed matches, add `pg_trgm` trigram search as a fuzzy fallback — still within PostgreSQL, zero additional infrastructure **2. Single `search_vector` on `documents`, aggregated from metadata + `transcription_blocks`** - `documents` is the natural search unit — users search for documents, not individual blocks - One vector, one GIN index, one ranked result set - Source fields: `documents` metadata + `transcription_blocks.text` joined at trigger time - Legacy `transcription` field excluded — it will be deleted; `transcription_blocks` is the canonical source **3. DB trigger on `documents` and `transcription_blocks` — not `@PreUpdate`** - The OCR pipeline writes blocks one at a time via `TranscriptionService`, bypassing any JPA lifecycle hooks on `Document` - A trigger on `transcription_blocks` fires on every insert/update/delete regardless of write source — the only approach that stays consistent - Write pattern confirmed from code: blocks are deleted in bulk, then re-inserted one at a time (30–100 inserts per OCR run); each fires the trigger and rebuilds the vector — acceptable cost at this scale, no debouncing needed **4. Weight assignment** | Weight | Field | |---|---| | A | `title` | | B | `summary`, `transcription_blocks.text` | | C | sender / receiver names | | D | `tags`, `location` | Transcription text stays at B (not A) because OCR output can be noisy; promoting it to A would let recognition errors distort rankings. **5. Single `german` config for all documents** - Most documents are German; non-German documents are still indexed and findable by exact token even without correct stemming - Per-document `language` column adds trigger complexity and a field to maintain — not justified until there's a real multi-language use case --- Overall: the proposal is architecturally sound and self-contained within PostgreSQL. No new infrastructure. The trigger-based sync and single-vector-per-document design are the right calls for this stack and scale.

marcel commented

2026-04-14 21:52:57 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

TDD approach for triggers and FTS queries

Writing a failing test first for a DB trigger is doable with Testcontainers (postgres:16-alpine includes the german dictionary), but it requires careful fixture setup. The key integration test: insert a transcription_block → assert documents.search_vector is updated. That's real-Postgres territory, not a unit test.
Where does the ILIKE query live today — a Specification, a @Query, a native query? That determines how invasive the replacement is. If it's in a Specification, the new tsquery path likely needs a native @Query instead, which is a structural change worth flagging up front.

JPA / Hibernate type mapping

Hibernate doesn't know what tsvector is. The search_vector column needs @Column(columnDefinition = "tsvector"). Since the trigger owns writes, the entity field should be @Generated(INSERT) (or just excluded from writes) and mapped to String for read-back. Otherwise Hibernate will try to write it on every save() and fail.
Is the search returning full Document entities or a projection? Full entities with lazy-loaded relations on a large result set will cause N+1 problems. If the search page only needs title, date, and sender, a projection DTO is safer.

API / error handling

to_tsquery('german', userInput) throws a Postgres syntax error on malformed input (unclosed parens, stray &). That becomes a 500 unless caught and mapped to DomainException.badRequest(). What's the error handling strategy for invalid query syntax?
Does the search endpoint response format change? If ts_rank() scores are added to responses, npm run generate:api needs to run and TypeScript types will change.

Suggestions

Use websearch_to_tsquery('german', ?) for user-facing input — it never throws on malformed input and understands AND, OR, - (NOT), and "phrase" natively, without exposing tsquery syntax to users.
Keep the API response type unchanged if scores aren't needed client-side — sort on the DB side with ORDER BY ts_rank(...) DESC and avoid touching the TypeScript layer.
Test naming suggestions: should_find_document_by_stem_when_searching_inflected_form, should_rank_title_match_above_transcription_match, should_return_empty_when_query_contains_only_stop_words.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations **TDD approach for triggers and FTS queries** - Writing a failing test first for a DB trigger is doable with Testcontainers (`postgres:16-alpine` includes the `german` dictionary), but it requires careful fixture setup. The key integration test: insert a `transcription_block` → assert `documents.search_vector` is updated. That's real-Postgres territory, not a unit test. - Where does the ILIKE query live today — a `Specification`, a `@Query`, a native query? That determines how invasive the replacement is. If it's in a `Specification`, the new tsquery path likely needs a native `@Query` instead, which is a structural change worth flagging up front. **JPA / Hibernate type mapping** - Hibernate doesn't know what `tsvector` is. The `search_vector` column needs `@Column(columnDefinition = "tsvector")`. Since the trigger owns writes, the entity field should be `@Generated(INSERT)` (or just excluded from writes) and mapped to `String` for read-back. Otherwise Hibernate will try to write it on every `save()` and fail. - Is the search returning full `Document` entities or a projection? Full entities with lazy-loaded relations on a large result set will cause N+1 problems. If the search page only needs title, date, and sender, a projection DTO is safer. **API / error handling** - `to_tsquery('german', userInput)` throws a Postgres syntax error on malformed input (unclosed parens, stray `&`). That becomes a 500 unless caught and mapped to `DomainException.badRequest()`. What's the error handling strategy for invalid query syntax? - Does the search endpoint response format change? If `ts_rank()` scores are added to responses, `npm run generate:api` needs to run and TypeScript types will change. ### Suggestions - Use `websearch_to_tsquery('german', ?)` for user-facing input — it never throws on malformed input and understands `AND`, `OR`, `-` (NOT), and `"phrase"` natively, without exposing tsquery syntax to users. - Keep the API response type unchanged if scores aren't needed client-side — sort on the DB side with `ORDER BY ts_rank(...) DESC` and avoid touching the TypeScript layer. - Test naming suggestions: `should_find_document_by_stem_when_searching_inflected_form`, `should_rank_title_match_above_transcription_match`, `should_return_empty_when_query_contains_only_stop_words`.

marcel commented

2026-04-14 21:53:06 +02:00

🧪 Sara Holt — QA Engineer

Questions & Observations

Missing acceptance criteria

The issue describes what FTS enables but not what "done" looks like as a verifiable test. Before implementation starts: what is the acceptance test for stemming? For ranking? For the german config? Concrete examples: "searching 'Briefe' must return documents containing 'Brief'" is testable. "Stemming works" is not.
Regression risk: the current ILIKE matches substrings — "rief" matches "Brief". FTS stemming does not match substrings. Some existing user searches will silently stop working. Is that acceptable, and should there be a test that captures which old behaviors are intentionally broken?

Test coverage plan

The trigger is the highest-risk item — it lives outside JPA, so only a real-Postgres integration test (Testcontainers) can catch bugs. Required test: insert a transcription_block → verify documents.search_vector is updated; delete a transcription_block → verify vector is rebuilt.
Edge cases I'd want covered as named test methods:
- Empty query string
- Query consisting only of German stop words ("und", "der", "die") — Postgres drops them silently, returning no results
- Umlauts and ß — "Müller", "straße" must work with the german config
- Document with no transcription_blocks — vector should still work from title/summary
- Very long query (>1000 chars) — should not cause a timeout or 500

Initial backfill path never tested in CI

The Flyway migration populates search_vector for existing rows. If the CI database starts empty, this backfill path is never exercised. Add a migration test that inserts a document before the FTS migration runs and asserts the vector is populated after all migrations complete.

Suggestions

ts_rank() ordering is non-deterministic for equal-ranked results. Add a secondary sort (e.g. document_date DESC) so result order is stable and test assertions don't depend on Postgres internals.
Add should_return_empty_result_when_query_contains_only_stop_words as an explicit test — this is the most likely user-visible surprise and it's easy to miss without a targeted test.
Define the test pyramid up front: unit (search service with mocked repo — verify query construction), integration (real Postgres + trigger), E2E (search box finds a document by stem).

## 🧪 Sara Holt — QA Engineer ### Questions & Observations **Missing acceptance criteria** - The issue describes what FTS *enables* but not what "done" looks like as a verifiable test. Before implementation starts: what is the acceptance test for stemming? For ranking? For the `german` config? Concrete examples: "searching 'Briefe' must return documents containing 'Brief'" is testable. "Stemming works" is not. - Regression risk: the current ILIKE matches substrings — "rief" matches "Brief". FTS stemming does **not** match substrings. Some existing user searches will silently stop working. Is that acceptable, and should there be a test that captures which old behaviors are intentionally broken? **Test coverage plan** - The trigger is the highest-risk item — it lives outside JPA, so only a real-Postgres integration test (Testcontainers) can catch bugs. Required test: insert a `transcription_block` → verify `documents.search_vector` is updated; delete a `transcription_block` → verify vector is rebuilt. - Edge cases I'd want covered as named test methods: - Empty query string - Query consisting only of German stop words (`"und"`, `"der"`, `"die"`) — Postgres drops them silently, returning no results - Umlauts and ß — `"Müller"`, `"straße"` must work with the `german` config - Document with no `transcription_blocks` — vector should still work from title/summary - Very long query (>1000 chars) — should not cause a timeout or 500 **Initial backfill path never tested in CI** - The Flyway migration populates `search_vector` for existing rows. If the CI database starts empty, this backfill path is never exercised. Add a migration test that inserts a document *before* the FTS migration runs and asserts the vector is populated after all migrations complete. ### Suggestions - `ts_rank()` ordering is non-deterministic for equal-ranked results. Add a secondary sort (e.g. `document_date DESC`) so result order is stable and test assertions don't depend on Postgres internals. - Add `should_return_empty_result_when_query_contains_only_stop_words` as an explicit test — this is the most likely user-visible surprise and it's easy to miss without a targeted test. - Define the test pyramid up front: unit (search service with mocked repo — verify query construction), integration (real Postgres + trigger), E2E (search box finds a document by stem).

marcel commented

2026-04-14 21:53:17 +02:00

🔐 Nora Steiner — Security Engineer

Questions & Observations

FTS query injection surface

to_tsquery('german', userInput) throws a Postgres syntax error on malformed input (((((, &|, unclosed parens). This is a denial-of-service surface: any unauthenticated or authenticated user can trigger repeated 500 errors with trivial input. It's not SQL injection (no data exfiltration), but it's a reliability issue with a security smell.
Fix: websearch_to_tsquery('german', ?) or plainto_tsquery('german', ?) — both never throw on malformed input. websearch_to_tsquery() supports AND, OR, -, "phrase" in Google syntax without exposing raw tsquery operators. plainto_tsquery() treats everything as literal terms.
CWE-20 (Improper Input Validation) applies here. The fix is one function name change.

Authorization boundary — routine confirmation

The issue doesn't mention permission requirements for the modified search endpoint. This is likely unchanged (@RequirePermission(Permission.READ_ALL) is already on the search controller), but the PR should explicitly confirm this is retained when the repository method is replaced. A @WebMvcTest that asserts 403 for a user without READ_ALL would cover this permanently.

Ranking as an information channel

ts_rank() returns a float. If rank scores are exposed in the API response, a user can infer content they haven't read: a high score means a strong match, revealing information about document text. For a family archive this is likely acceptable, but it's an implicit decision worth making explicit. If scores are backend-only (used for ORDER BY), the concern disappears.
Future feature note: if ts_headline() snippets are ever added to search results, they return fragments of raw transcription text. Those fragments must respect the same visibility rules as full documents — something to call out in a future issue rather than implement as an afterthought.

Suggestions

Use websearch_to_tsquery() as the default. It's the safest option for user-facing input and gives users familiar boolean semantics.
Add one regression test: should_not_throw_500_when_query_contains_invalid_tsquery_syntax. If websearch_to_tsquery() is used, this test verifies graceful handling; if someone later changes it to to_tsquery(), the test catches it immediately.
Confirm @RequirePermission(READ_ALL) is retained on the search endpoint — one line in the PR description, one permanent test.

## 🔐 Nora Steiner — Security Engineer ### Questions & Observations **FTS query injection surface** - `to_tsquery('german', userInput)` throws a Postgres syntax error on malformed input (`((((`, `&|`, unclosed parens). This is a denial-of-service surface: any unauthenticated or authenticated user can trigger repeated 500 errors with trivial input. It's not SQL injection (no data exfiltration), but it's a reliability issue with a security smell. - Fix: `websearch_to_tsquery('german', ?)` or `plainto_tsquery('german', ?)` — both never throw on malformed input. `websearch_to_tsquery()` supports `AND`, `OR`, `-`, `"phrase"` in Google syntax without exposing raw tsquery operators. `plainto_tsquery()` treats everything as literal terms. - CWE-20 (Improper Input Validation) applies here. The fix is one function name change. **Authorization boundary — routine confirmation** - The issue doesn't mention permission requirements for the modified search endpoint. This is likely unchanged (`@RequirePermission(Permission.READ_ALL)` is already on the search controller), but the PR should explicitly confirm this is retained when the repository method is replaced. A `@WebMvcTest` that asserts 403 for a user without `READ_ALL` would cover this permanently. **Ranking as an information channel** - `ts_rank()` returns a float. If rank scores are exposed in the API response, a user can infer content they haven't read: a high score means a strong match, revealing information about document text. For a family archive this is likely acceptable, but it's an implicit decision worth making explicit. If scores are backend-only (used for `ORDER BY`), the concern disappears. - Future feature note: if `ts_headline()` snippets are ever added to search results, they return fragments of raw transcription text. Those fragments must respect the same visibility rules as full documents — something to call out in a future issue rather than implement as an afterthought. ### Suggestions - Use `websearch_to_tsquery()` as the default. It's the safest option for user-facing input and gives users familiar boolean semantics. - Add one regression test: `should_not_throw_500_when_query_contains_invalid_tsquery_syntax`. If `websearch_to_tsquery()` is used, this test verifies graceful handling; if someone later changes it to `to_tsquery()`, the test catches it immediately. - Confirm `@RequirePermission(READ_ALL)` is retained on the search endpoint — one line in the PR description, one permanent test.

marcel commented

2026-04-14 21:53:32 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

This issue is backend-focused, but search is the primary user interaction in the archive. A few UX questions worth resolving before shipping.

Behavior changes that will surprise users

Current ILIKE matches substrings: typing "rief" finds "Brief", "Friedhof", "Briefe". FTS does not match substrings. Users who learned this shortcut will get zero results and assume the feature is broken. Is there a plan to communicate what changed?
Stop words (der, die, das, und, von) are silently dropped by the german config. A user typing "die Briefe von Oma" gets results for "Brief" and "Oma" only — but doesn't know why die and von were ignored. The current empty-state message "Keine Dokumente gefunden" gives no hint.

Discoverability of new capabilities

Boolean operators and prefix matching are powerful but invisible. If they're exposed as raw tsquery syntax (Grund:*, Brief & Oma), users have no way to discover them. If they're available via websearch_to_tsquery() syntax (Grundstück OR Grundbuch, -Oma), that's more natural but still undiscovered without a hint.
Relevance ranking changes the result order. Users may be confused when a document from 1923 appears before one from 1987. A subtle label — "Sortiert nach Relevanz" — near the results would explain the new behavior.

Opportunities worth noting for a follow-up

Matched term highlighting via ts_headline() would make results dramatically more useful for a document archive. Users scanning for "Grundstück" can see the sentence where it appears without opening each document. Worth a separate issue.
An incremental "did you mean?" or "no results — try these related terms" state for stop-word-only queries would be a meaningful UX improvement at low implementation cost.

Suggestions

At minimum, add a one-line help text near the search input: "Suche nach Wörtern, Namen oder Orten. Verwende AND, OR, - um Begriffe zu kombinieren." — even if only added post-launch.
Consider exposing a "?" icon with an info tooltip listing supported search syntax, built around what websearch_to_tsquery() supports.
For the empty state when stop-word-only queries are entered, a specific message like "Deine Suche enthält nur häufige Wörter, die ignoriert wurden. Versuche spezifischere Begriffe." would prevent frustrated users.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations This issue is backend-focused, but search is the primary user interaction in the archive. A few UX questions worth resolving before shipping. **Behavior changes that will surprise users** - Current ILIKE matches substrings: typing "rief" finds "Brief", "Friedhof", "Briefe". FTS does not match substrings. Users who learned this shortcut will get zero results and assume the feature is broken. Is there a plan to communicate what changed? - Stop words (`der`, `die`, `das`, `und`, `von`) are silently dropped by the `german` config. A user typing "die Briefe von Oma" gets results for "Brief" and "Oma" only — but doesn't know why `die` and `von` were ignored. The current empty-state message "Keine Dokumente gefunden" gives no hint. **Discoverability of new capabilities** - Boolean operators and prefix matching are powerful but invisible. If they're exposed as raw tsquery syntax (`Grund:*`, `Brief & Oma`), users have no way to discover them. If they're available via `websearch_to_tsquery()` syntax (`Grundstück OR Grundbuch`, `-Oma`), that's more natural but still undiscovered without a hint. - Relevance ranking changes the result order. Users may be confused when a document from 1923 appears before one from 1987. A subtle label — "Sortiert nach Relevanz" — near the results would explain the new behavior. **Opportunities worth noting for a follow-up** - Matched term highlighting via `ts_headline()` would make results dramatically more useful for a document archive. Users scanning for "Grundstück" can see the sentence where it appears without opening each document. Worth a separate issue. - An incremental "did you mean?" or "no results — try these related terms" state for stop-word-only queries would be a meaningful UX improvement at low implementation cost. ### Suggestions - At minimum, add a one-line help text near the search input: "Suche nach Wörtern, Namen oder Orten. Verwende `AND`, `OR`, `-` um Begriffe zu kombinieren." — even if only added post-launch. - Consider exposing a "?" icon with an info tooltip listing supported search syntax, built around what `websearch_to_tsquery()` supports. - For the empty state when stop-word-only queries are entered, a specific message like "Deine Suche enthält nur häufige Wörter, die ignoriert wurden. Versuche spezifischere Begriffe." would prevent frustrated users.

marcel commented

2026-04-14 21:53:43 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Initial Flyway migration performance

The migration that adds search_vector and backfills existing rows runs inside a transaction. On a live database with hundreds of documents and thousands of transcription_blocks, this transaction holds table locks for the duration of the backfill — potentially blocking application writes during deploy. Has this been tested against a database with realistic data volume?
Consider splitting it: one migration for the DDL (ADD COLUMN search_vector tsvector) and a second for the data backfill (UPDATE documents SET search_vector = ...). The backfill can also run outside Flyway's default transaction boundary by setting @NonTransactional on the migration, avoiding lock contention.

german dictionary availability in Testcontainers

The test setup uses postgres:16-alpine. Alpine-based PostgreSQL images may not include all language dictionaries by default — the german config requires postgresql-contrib (specifically the tsearch_data files). If the dictionary is missing, FTS queries fail with text search configuration "german" does not exist and integration tests will fail in CI but pass locally if developers use a full Debian-based image.
Quick verification: add a SELECT cfgname FROM pg_ts_config WHERE cfgname = 'german' assertion in the migration integration test, or just run the Testcontainers suite once in CI before writing all the tests.

GIN index size

A GIN index on a tsvector column can add 20–50% to the indexed table size depending on content density. With transcription text from OCR, the vectors will be large. Worth baselining pg_database_size() and pg_indexes_size('documents') before and after the migration in the deploy notes, so any unexpected growth is caught rather than discovered weeks later when backup times increase.

Trigger overhead on OCR runs

The architect confirmed 30–100 transcription_block inserts per OCR run, each firing the trigger to rebuild the full tsvector. For a 100-block document that's 100 recalculations in one transaction. At current scale this is fine, but it would be useful to log the total time spent rebuilding the vector during an OCR run (or emit a metric) so degradation is visible if document count grows.

Suggestions

Split the Flyway migration into DDL + backfill steps and add a comment explaining why — the next developer deploying to a production database with real data will thank you.
Verify german dictionary presence in the Testcontainers setup early — before writing trigger tests — to avoid a mysterious CI failure that wastes half a day.
Add pg_database_size and pg_indexes_size to the PR deploy checklist as pre/post metrics.

## 🛠️ Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations **Initial Flyway migration performance** - The migration that adds `search_vector` and backfills existing rows runs inside a transaction. On a live database with hundreds of documents and thousands of `transcription_blocks`, this transaction holds table locks for the duration of the backfill — potentially blocking application writes during deploy. Has this been tested against a database with realistic data volume? - Consider splitting it: one migration for the DDL (`ADD COLUMN search_vector tsvector`) and a second for the data backfill (`UPDATE documents SET search_vector = ...`). The backfill can also run outside Flyway's default transaction boundary by setting `@NonTransactional` on the migration, avoiding lock contention. **`german` dictionary availability in Testcontainers** - The test setup uses `postgres:16-alpine`. Alpine-based PostgreSQL images may not include all language dictionaries by default — the `german` config requires `postgresql-contrib` (specifically the `tsearch_data` files). If the dictionary is missing, FTS queries fail with `text search configuration "german" does not exist` and integration tests will fail in CI but pass locally if developers use a full Debian-based image. - Quick verification: add a `SELECT cfgname FROM pg_ts_config WHERE cfgname = 'german'` assertion in the migration integration test, or just run the Testcontainers suite once in CI before writing all the tests. **GIN index size** - A GIN index on a `tsvector` column can add 20–50% to the indexed table size depending on content density. With transcription text from OCR, the vectors will be large. Worth baselining `pg_database_size()` and `pg_indexes_size('documents')` before and after the migration in the deploy notes, so any unexpected growth is caught rather than discovered weeks later when backup times increase. **Trigger overhead on OCR runs** - The architect confirmed 30–100 `transcription_block` inserts per OCR run, each firing the trigger to rebuild the full `tsvector`. For a 100-block document that's 100 recalculations in one transaction. At current scale this is fine, but it would be useful to log the total time spent rebuilding the vector during an OCR run (or emit a metric) so degradation is visible if document count grows. ### Suggestions - Split the Flyway migration into DDL + backfill steps and add a comment explaining why — the next developer deploying to a production database with real data will thank you. - Verify `german` dictionary presence in the Testcontainers setup early — before writing trigger tests — to avoid a mysterious CI failure that wastes half a day. - Add `pg_database_size` and `pg_indexes_size` to the PR deploy checklist as pre/post metrics.

marcel commented

2026-04-14 22:30:46 +02:00

👨‍💻 Felix Brandt — Implementation Discussion

Working through the open implementation questions before coding starts. All six items resolved.

Resolved

1. ILIKE query location
The text search lives in DocumentSpecifications.hasText() — a JPA Criteria API Specification. The Criteria API can't express search_vector @@ tsquery, so hasText() is replaced via Option C (two-phase):

A native repository method returns ranked List<UUID> from the FTS query
hasText() is replaced by hasIds(List<UUID>) — a simple root.get("id").in(ids) predicate
All other Specifications (isBetween, hasSender, hasTags, etc.) remain unchanged

2. Query function: websearch_to_tsquery
Use websearch_to_tsquery('german', ?) for all user input. Never throws on malformed input (no 500 risk). Supports AND, OR, -, "phrase" in Google syntax without exposing raw tsquery operators to users.

3. JPA entity mapping
search_vector is not mapped in the Document entity. The trigger owns it; JPA never touches it. The native FTS query reads it directly in SQL.

4. Search result type
Keep List<Document> (full entities) — no projection DTO in this PR. Add @EntityGraph on the findAll call to eager-load sender, receivers, and tags and prevent N+1. Projection DTO is a future optimization.

5. Sort: RELEVANCE
Add RELEVANCE to DocumentSort enum. Becomes the default when text is present and no explicit sort param was passed. Rank order preserved via in-memory sort: ids.indexOf(doc.getId()). Native FTS query uses ORDER BY ts_rank(...) DESC, document_date DESC for stable tiebreaking.

6. TDD starting point
New DocumentFtsTest in repository/, using the existing @DataJpaTest + @Import({PostgresContainerConfig.class, FlywayConfig.class}) pattern (same as DocumentSpecificationsTest). First failing test: insert a transcription_block → assert search_vector @@ plainto_tsquery('german', ...) returns 1 row. Write the trigger SQL to make it green.

Additional finding: tag hierarchy (issue #221)

No blocking dependency between #221 and #222. The FTS trigger indexes tag names at weight D (string_agg(t.name, ' ')); it reads only t.name and is unaffected by any parent_id column #221 may add. Intentional decision: the trigger does not inject parent tag names into the vector — hierarchical tag navigation is handled by the hasTags() Specification filter, not by text search. This keeps the two features independent.

Overall read

Implementation path is clear. The two-phase ID approach keeps the existing filter Specifications intact, websearch_to_tsquery removes the only security/reliability risk, and the existing Testcontainers infrastructure means the trigger test can be written immediately. No new infrastructure needed for this PR.

## 👨‍💻 Felix Brandt — Implementation Discussion Working through the open implementation questions before coding starts. All six items resolved. --- ### Resolved **1. ILIKE query location** The text search lives in `DocumentSpecifications.hasText()` — a JPA Criteria API Specification. The Criteria API can't express `search_vector @@ tsquery`, so `hasText()` is replaced via **Option C (two-phase)**: - A native repository method returns ranked `List<UUID>` from the FTS query - `hasText()` is replaced by `hasIds(List<UUID>)` — a simple `root.get("id").in(ids)` predicate - All other Specifications (`isBetween`, `hasSender`, `hasTags`, etc.) remain unchanged **2. Query function: `websearch_to_tsquery`** Use `websearch_to_tsquery('german', ?)` for all user input. Never throws on malformed input (no 500 risk). Supports `AND`, `OR`, `-`, `"phrase"` in Google syntax without exposing raw tsquery operators to users. **3. JPA entity mapping** `search_vector` is **not mapped in the `Document` entity**. The trigger owns it; JPA never touches it. The native FTS query reads it directly in SQL. **4. Search result type** Keep `List<Document>` (full entities) — no projection DTO in this PR. Add `@EntityGraph` on the `findAll` call to eager-load `sender`, `receivers`, and `tags` and prevent N+1. Projection DTO is a future optimization. **5. Sort: RELEVANCE** Add `RELEVANCE` to `DocumentSort` enum. Becomes the default when `text` is present and no explicit `sort` param was passed. Rank order preserved via in-memory sort: `ids.indexOf(doc.getId())`. Native FTS query uses `ORDER BY ts_rank(...) DESC, document_date DESC` for stable tiebreaking. **6. TDD starting point** New `DocumentFtsTest` in `repository/`, using the existing `@DataJpaTest` + `@Import({PostgresContainerConfig.class, FlywayConfig.class})` pattern (same as `DocumentSpecificationsTest`). First failing test: insert a `transcription_block` → assert `search_vector @@ plainto_tsquery('german', ...)` returns 1 row. Write the trigger SQL to make it green. --- ### Additional finding: tag hierarchy (issue #221) No blocking dependency between #221 and #222. The FTS trigger indexes tag names at weight D (`string_agg(t.name, ' ')`); it reads only `t.name` and is unaffected by any `parent_id` column #221 may add. Intentional decision: the trigger does **not** inject parent tag names into the vector — hierarchical tag navigation is handled by the `hasTags()` Specification filter, not by text search. This keeps the two features independent. --- ### Overall read Implementation path is clear. The two-phase ID approach keeps the existing filter Specifications intact, `websearch_to_tsquery` removes the only security/reliability risk, and the existing Testcontainers infrastructure means the trigger test can be written immediately. No new infrastructure needed for this PR.

marcel commented

2026-04-14 23:53:02 +02:00

Implementierung abgeschlossen ✅

Branch: feat/issue-222-fts-search

Was wurde umgesetzt

Migration V34 (efeb913) — DDL: search_vector tsvector-Spalte + GIN-Index auf documents, Trigger-Funktion fn_documents_fts_update() (BEFORE INSERT/UPDATE, berechnet gewichteten Vektor: A=Titel, B=Zusammenfassung+Transkription, C=Sender/Empfänger-Namen, D=Tags+Ort) sowie AFTER-Trigger auf transcription_blocks, document_receivers und document_tags, die bei Änderungen das Elterndokument neu indizieren.

Migration V35 (3421f32) — Backfill: UPDATE documents SET title = title feuert den BEFORE-UPDATE-Trigger für alle existierenden Zeilen und befüllt search_vector mit dem aktuellen Stand aller Join-Tabellen.

Java-Änderungen (349f74d):

DocumentSort um RELEVANCE erweitert
DocumentRepository.findRankedIdsByFts() — native Query mit websearch_to_tsquery('german', ?) + ts_rank-Sortierung
DocumentSpecifications: hasText() entfernt, hasIds(List<UUID>) hinzugefügt
DocumentService.searchDocuments(): zweistufige Suche — FTS liefert gerankte List<UUID>, Specification filtert verbleibende Kriterien; RELEVANCE ist Standard, wenn Text vorhanden

Tests

12 neue Integrationstests in DocumentFtsTest:

Deutschen FTS-Config verfügbar
Exakter Titelwort-Match
Stemming: „Briefe" findet Dokument mit Titel „Brief"
Suche nach Transkriptionsblock-Text
Trigger-Rebuild nach nachträglichem Block-Insert und Block-Delete
Gewichtung: Titel-Match (A) rankt höher als Transkriptions-Match (B)
Stop-Wörter („der die das") → 0 Ergebnisse, kein Fehler
Fehlerhafte Query-Syntax („(((") → 0 Ergebnisse, kein Fehler
Suche nach Empfängername (Gewicht C)
Suche nach Tag-Namen (Gewicht D)

Alle 895 Tests grün. JAR-Build erfolgreich.

Commits

efeb913 feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34)
349f74d feat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort
3421f32 feat(fts): backfill search_vector for all existing documents (V35)

## Implementierung abgeschlossen ✅ Branch: `feat/issue-222-fts-search` ### Was wurde umgesetzt **Migration V34** (`efeb913`) — DDL: `search_vector tsvector`-Spalte + GIN-Index auf `documents`, Trigger-Funktion `fn_documents_fts_update()` (BEFORE INSERT/UPDATE, berechnet gewichteten Vektor: A=Titel, B=Zusammenfassung+Transkription, C=Sender/Empfänger-Namen, D=Tags+Ort) sowie AFTER-Trigger auf `transcription_blocks`, `document_receivers` und `document_tags`, die bei Änderungen das Elterndokument neu indizieren. **Migration V35** (`3421f32`) — Backfill: `UPDATE documents SET title = title` feuert den BEFORE-UPDATE-Trigger für alle existierenden Zeilen und befüllt `search_vector` mit dem aktuellen Stand aller Join-Tabellen. **Java-Änderungen** (`349f74d`): - `DocumentSort` um `RELEVANCE` erweitert - `DocumentRepository.findRankedIdsByFts()` — native Query mit `websearch_to_tsquery('german', ?)` + `ts_rank`-Sortierung - `DocumentSpecifications`: `hasText()` entfernt, `hasIds(List<UUID>)` hinzugefügt - `DocumentService.searchDocuments()`: zweistufige Suche — FTS liefert gerankte `List<UUID>`, Specification filtert verbleibende Kriterien; RELEVANCE ist Standard, wenn Text vorhanden ### Tests 12 neue Integrationstests in `DocumentFtsTest`: - Deutschen FTS-Config verfügbar - Exakter Titelwort-Match - Stemming: „Briefe" findet Dokument mit Titel „Brief" - Suche nach Transkriptionsblock-Text - Trigger-Rebuild nach nachträglichem Block-Insert und Block-Delete - Gewichtung: Titel-Match (A) rankt höher als Transkriptions-Match (B) - Stop-Wörter („der die das") → 0 Ergebnisse, kein Fehler - Fehlerhafte Query-Syntax („(((") → 0 Ergebnisse, kein Fehler - Suche nach Empfängername (Gewicht C) - Suche nach Tag-Namen (Gewicht D) Alle 895 Tests grün. JAR-Build erfolgreich. ### Commits - `efeb913` feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34) - `349f74d` feat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort - `3421f32` feat(fts): backfill search_vector for all existing documents (V35)

marcel referenced a pull request that will close this issue

2026-04-15 08:51:16 +02:00

feat(search): upgrade to PostgreSQL full-text search with German stemming #237

marcel closed this issue

2026-04-15 12:40:22 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#222