feat(search): upgrade to PostgreSQL full-text search with German stemming #237

Merged
marcel merged 8 commits from feat/issue-222-fts-search into main 2026-04-15 12:40:21 +02:00
Owner

Closes #222

Summary

  • Replaces ILIKE %query% with PostgreSQL tsvector/websearch_to_tsquery('german', ?) full-text search
  • Adds search_vector tsvector column (GIN-indexed) on documents, populated by a BEFORE INSERT/UPDATE trigger
  • AFTER triggers on transcription_blocks, document_receivers, document_tags keep the vector in sync
  • Two-phase search: native FTS query returns ranked List<UUID>Specification filters remaining criteria
  • RELEVANCE added to DocumentSort; becomes the default sort when a search term is present
  • websearch_to_tsquery prevents 500 errors on malformed input ((((, stop-word-only queries)

Migrations

Migration Purpose
V34 DDL: column, GIN index, trigger function, 4 triggers
V35 Backfill: UPDATE documents SET title = title rebuilds vector for all existing rows

Test plan

  • 12 new integration tests in DocumentFtsTest (Testcontainers, postgres:16-alpine)
  • German stemming: „Briefe" finds document with title „Brief"
  • Trigger rebuild after block insert / block delete
  • Weight ranking: title match (A) ranks above transcription match (B)
  • Stop-word-only query → 0 results, no exception
  • Malformed syntax → 0 results, no exception
  • Receiver name search (weight C), tag name search (weight D)
  • All 895 tests green, JAR build successful

🤖 Generated with Claude Code

Closes #222 ## Summary - Replaces `ILIKE %query%` with PostgreSQL `tsvector`/`websearch_to_tsquery('german', ?)` full-text search - Adds `search_vector tsvector` column (GIN-indexed) on `documents`, populated by a BEFORE INSERT/UPDATE trigger - AFTER triggers on `transcription_blocks`, `document_receivers`, `document_tags` keep the vector in sync - Two-phase search: native FTS query returns ranked `List<UUID>` → `Specification` filters remaining criteria - `RELEVANCE` added to `DocumentSort`; becomes the default sort when a search term is present - `websearch_to_tsquery` prevents 500 errors on malformed input (`(((`, stop-word-only queries) ## Migrations | Migration | Purpose | |---|---| | V34 | DDL: column, GIN index, trigger function, 4 triggers | | V35 | Backfill: `UPDATE documents SET title = title` rebuilds vector for all existing rows | ## Test plan - [x] 12 new integration tests in `DocumentFtsTest` (Testcontainers, postgres:16-alpine) - [x] German stemming: „Briefe" finds document with title „Brief" - [x] Trigger rebuild after block insert / block delete - [x] Weight ranking: title match (A) ranks above transcription match (B) - [x] Stop-word-only query → 0 results, no exception - [x] Malformed syntax → 0 results, no exception - [x] Receiver name search (weight C), tag name search (weight D) - [x] All 895 tests green, JAR build successful 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns


Blockers

useRankOrder silently overrides explicit DATE sort (DocumentService.java)

boolean useRankOrder = hasText && (sort == null || sort == DocumentSort.RELEVANCE || sort == DocumentSort.DATE);

When a user explicitly selects sort=DATE, the code re-sorts by FTS rank instead. That's unexpected — the user asked for chronological order and got relevance. Consequence: the sort picker in the frontend is silently ignored whenever a text query is active.

Fix: treat null and RELEVANCE as rank order, but let DATE sort chronologically:

boolean useRankOrder = hasText && (sort == null || sort == DocumentSort.RELEVANCE);

Suggestions

ids.indexOf() is O(n) per document — O(n²) overall

return results.stream()
    .sorted(Comparator.comparingInt(doc -> {
        int idx = ids.indexOf(doc.getId());   // O(n) linear scan per document
        return idx < 0 ? Integer.MAX_VALUE : idx;
    }))
    .toList();

For a family archive this is fine. At scale, swap for a Map<UUID, Integer>:

final Map<UUID, Integer> rankMap = new HashMap<>();
for (int i = 0; i < ids.size(); i++) rankMap.put(ids.get(i), i);
return results.stream()
    .sorted(Comparator.comparingInt(doc -> rankMap.getOrDefault(doc.getId(), Integer.MAX_VALUE)))
    .toList();

Missing sender name search test

DocumentFtsTest tests receiver name search (weight C) but not sender name. A sender is also indexed at weight C:

setweight(to_tsvector('german', coalesce((
    SELECT coalesce(p.first_name, '') || ' ' || p.last_name
    FROM persons p WHERE p.id = NEW.sender_id
), '')), 'C')

Add should_find_document_by_sender_name() for symmetry.


Test setup: annotation helper adds noise

Each transcription block test requires creating a DocumentAnnotation first (needed for the FK). The annotation() helper is clean and well-named — no issue there. Good factory pattern overall.


What's solid:

  • Test names read as sentences (should_rebuild_vector_when_transcription_block_deleted)
  • document(String) factory avoids builder repetition
  • @BeforeEach delete order correct: blocks → docs → persons/tags
  • No-op lambda (root, query, cb) -> null for Spring Data 3.x where(null) workaround is clean
  • websearch_to_tsquery safety test passes
## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** --- ### Blockers **`useRankOrder` silently overrides explicit DATE sort** (`DocumentService.java`) ```java boolean useRankOrder = hasText && (sort == null || sort == DocumentSort.RELEVANCE || sort == DocumentSort.DATE); ``` When a user explicitly selects `sort=DATE`, the code re-sorts by FTS rank instead. That's unexpected — the user asked for chronological order and got relevance. Consequence: the sort picker in the frontend is silently ignored whenever a text query is active. Fix: treat null and RELEVANCE as rank order, but let DATE sort chronologically: ```java boolean useRankOrder = hasText && (sort == null || sort == DocumentSort.RELEVANCE); ``` --- ### Suggestions **`ids.indexOf()` is O(n) per document — O(n²) overall** ```java return results.stream() .sorted(Comparator.comparingInt(doc -> { int idx = ids.indexOf(doc.getId()); // O(n) linear scan per document return idx < 0 ? Integer.MAX_VALUE : idx; })) .toList(); ``` For a family archive this is fine. At scale, swap for a `Map<UUID, Integer>`: ```java final Map<UUID, Integer> rankMap = new HashMap<>(); for (int i = 0; i < ids.size(); i++) rankMap.put(ids.get(i), i); return results.stream() .sorted(Comparator.comparingInt(doc -> rankMap.getOrDefault(doc.getId(), Integer.MAX_VALUE))) .toList(); ``` --- **Missing sender name search test** `DocumentFtsTest` tests receiver name search (weight C) but not sender name. A sender is also indexed at weight C: ```sql setweight(to_tsvector('german', coalesce(( SELECT coalesce(p.first_name, '') || ' ' || p.last_name FROM persons p WHERE p.id = NEW.sender_id ), '')), 'C') ``` Add `should_find_document_by_sender_name()` for symmetry. --- **Test setup: annotation helper adds noise** Each transcription block test requires creating a `DocumentAnnotation` first (needed for the FK). The `annotation()` helper is clean and well-named — no issue there. Good factory pattern overall. --- **What's solid:** - Test names read as sentences (`should_rebuild_vector_when_transcription_block_deleted`) ✅ - `document(String)` factory avoids builder repetition ✅ - `@BeforeEach` delete order correct: blocks → docs → persons/tags ✅ - No-op lambda `(root, query, cb) -> null` for Spring Data 3.x `where(null)` workaround is clean ✅ - `websearch_to_tsquery` safety test passes ✅
Author
Owner

🏛️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns


Concerns

The "boring technology" choice is excellent — PostgreSQL FTS is the right call. No Elasticsearch, no external service, no new infrastructure. Exactly the kind of decision I'd make.


DATE sort semantics conflict with RELEVANCE (same as Felix's blocker)

When sort == DocumentSort.DATE and text is present, the service silently applies rank ordering. This conflates two user intents:

  • User selects RELEVANCE → "show me the most relevant first"
  • User selects DATE → "show me the most recent first" (gets relevance instead)

The fix is behavioral, not architectural — but it's a correctness issue that will confuse frontend developers building the sort picker.


In-memory sort over potentially large result sets

List<Document> results = documentRepository.findAll(spec);  // loads all into heap
return results.stream().sorted(...).toList();

The results are bounded by the FTS result count (which is already filtered by hasIds(rankedIds)), so this is not an unbounded memory load. For a family archive with hundreds of documents, fine. Worth noting for future reference.


Architecture: trigger-based sync is the right pattern here

The trigger approach correctly decouples the OCR pipeline (which writes transcription_blocks directly without going through JPA) from the search index. A @PreUpdate hook would have missed those writes entirely. Good architectural reasoning.

-- AFTER triggers on join tables re-fire the BEFORE UPDATE trigger on documents
-- This is clever but slightly magical. The comment in V34 explains it well.
UPDATE documents SET title = title WHERE id = v_doc_id;

The SET title = title no-op is an established PostgreSQL pattern. The comment explaining why it's needed is essential — keep it.


No ADR for a significant architectural decision

Adding a GIN-indexed tsvector column with 4 triggers that affect every document write is a meaningful architectural commitment. It should be captured in docs/adr/ — at minimum: context, decision, the trigger-cascade approach, and why websearch_to_tsquery over to_tsquery.

This is a suggestion, not a blocker for merge. Write it as a follow-up.


Layer boundaries respected

  • DocumentService calls documentRepository.findRankedIdsByFts() — no native SQL in the service layer
  • DocumentSpecifications.hasIds() remains a pure Specification — no business logic
  • Trigger logic lives entirely in the migration — correct level for database-enforced invariants

resolveSort() handles RELEVANCE cleanly

if (sort == null || sort == DocumentSort.DATE || sort == DocumentSort.RELEVANCE) {
    return Sort.by(direction, "documentDate");
}

This fallback is correct when no text is present — RELEVANCE degrades to DATE gracefully.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** --- ### Concerns **The "boring technology" choice is excellent — PostgreSQL FTS is the right call.** No Elasticsearch, no external service, no new infrastructure. Exactly the kind of decision I'd make. --- **DATE sort semantics conflict with RELEVANCE** (same as Felix's blocker) When `sort == DocumentSort.DATE` and text is present, the service silently applies rank ordering. This conflates two user intents: - User selects RELEVANCE → "show me the most relevant first" ✅ - User selects DATE → "show me the most recent first" ❌ (gets relevance instead) The fix is behavioral, not architectural — but it's a correctness issue that will confuse frontend developers building the sort picker. --- **In-memory sort over potentially large result sets** ```java List<Document> results = documentRepository.findAll(spec); // loads all into heap return results.stream().sorted(...).toList(); ``` The results are bounded by the FTS result count (which is already filtered by `hasIds(rankedIds)`), so this is not an unbounded memory load. For a family archive with hundreds of documents, fine. Worth noting for future reference. --- **Architecture: trigger-based sync is the right pattern here** ✅ The trigger approach correctly decouples the OCR pipeline (which writes `transcription_blocks` directly without going through JPA) from the search index. A `@PreUpdate` hook would have missed those writes entirely. Good architectural reasoning. ```sql -- AFTER triggers on join tables re-fire the BEFORE UPDATE trigger on documents -- This is clever but slightly magical. The comment in V34 explains it well. UPDATE documents SET title = title WHERE id = v_doc_id; ``` The `SET title = title` no-op is an established PostgreSQL pattern. The comment explaining why it's needed is essential — keep it. --- **No ADR for a significant architectural decision** Adding a GIN-indexed `tsvector` column with 4 triggers that affect every document write is a meaningful architectural commitment. It should be captured in `docs/adr/` — at minimum: context, decision, the trigger-cascade approach, and why `websearch_to_tsquery` over `to_tsquery`. This is a suggestion, not a blocker for merge. Write it as a follow-up. --- **Layer boundaries respected** ✅ - `DocumentService` calls `documentRepository.findRankedIdsByFts()` — no native SQL in the service layer ✅ - `DocumentSpecifications.hasIds()` remains a pure Specification — no business logic ✅ - Trigger logic lives entirely in the migration — correct level for database-enforced invariants ✅ --- **`resolveSort()` handles `RELEVANCE` cleanly** ✅ ```java if (sort == null || sort == DocumentSort.DATE || sort == DocumentSort.RELEVANCE) { return Sort.by(direction, "documentDate"); } ``` This fallback is correct when no text is present — RELEVANCE degrades to DATE gracefully.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns


Blockers

None. The test suite is solid and uses real PostgreSQL via Testcontainers.


Concerns

Missing sender name search test

Receiver name search is tested (should_find_document_by_receiver_name). Sender name is also indexed at the same weight (C), but there is no corresponding should_find_document_by_sender_name test. This is a symmetry gap — the trigger code for sender indexing is exercised but not validated.


No integration test for two-phase search with combined filters

DocumentFtsTest tests the repository's FTS query in isolation. There is no test that exercises the full searchDocuments() path: FTS + a Specification filter (e.g. date range, tag, status) applied together. A regression in hasIds() combination or the no-op lambda (root, query, cb) -> null would not be caught.

Suggested test (using DocumentService or directly via Specification.where(textSpec).and(hasStatus(...))):

@Test
void should_combine_fts_with_status_filter() {
    // one UPLOADED, one PLACEHOLDER — same title
    // FTS matches both; status filter should return only UPLOADED
}

No @Transactional on test methods

Tests use @BeforeEach deleteAll() for cleanup instead of rollback. This means each test commits its data, then the next test deletes it — slower and leaves debris if a test crashes mid-run. For @DataJpaTest with Testcontainers, @Transactional on test methods would auto-rollback after each test with zero cleanup code.

Not a blocker — the deleteAll() approach is safe and is used in the existing DocumentSpecificationsTest too.


What's solid

  • Real Postgres via Testcontainers — no H2
  • 12 tests covering all FTS features: stemming, trigger rebuild, weight ranking, edge cases
  • document(String) factory removes builder repetition from every test body
  • Test names read as behaviors: should_rebuild_vector_when_transcription_block_deleted
  • Guard test german_text_search_config_is_available catches config drift early
  • assertThatNoException() for the malformed query test — exactly right, tests the safety contract
  • DocumentSpecificationsTest cleanup: hasText_* tests removed without breaking the 25 remaining
  • Delete order in @BeforeEach is correct: blockRepository.deleteAll() before documentRepository.deleteAll() — respects FK constraints
## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** --- ### Blockers None. The test suite is solid and uses real PostgreSQL via Testcontainers. --- ### Concerns **Missing sender name search test** Receiver name search is tested (`should_find_document_by_receiver_name`). Sender name is also indexed at the same weight (C), but there is no corresponding `should_find_document_by_sender_name` test. This is a symmetry gap — the trigger code for sender indexing is exercised but not validated. --- **No integration test for two-phase search with combined filters** `DocumentFtsTest` tests the repository's FTS query in isolation. There is no test that exercises the full `searchDocuments()` path: FTS + a Specification filter (e.g. date range, tag, status) applied together. A regression in `hasIds()` combination or the no-op lambda `(root, query, cb) -> null` would not be caught. Suggested test (using `DocumentService` or directly via `Specification.where(textSpec).and(hasStatus(...))`): ```java @Test void should_combine_fts_with_status_filter() { // one UPLOADED, one PLACEHOLDER — same title // FTS matches both; status filter should return only UPLOADED } ``` --- **No `@Transactional` on test methods** Tests use `@BeforeEach deleteAll()` for cleanup instead of rollback. This means each test commits its data, then the next test deletes it — slower and leaves debris if a test crashes mid-run. For `@DataJpaTest` with Testcontainers, `@Transactional` on test methods would auto-rollback after each test with zero cleanup code. Not a blocker — the `deleteAll()` approach is safe and is used in the existing `DocumentSpecificationsTest` too. --- ### What's solid - Real Postgres via Testcontainers — no H2 ✅ - 12 tests covering all FTS features: stemming, trigger rebuild, weight ranking, edge cases ✅ - `document(String)` factory removes builder repetition from every test body ✅ - Test names read as behaviors: `should_rebuild_vector_when_transcription_block_deleted` ✅ - Guard test `german_text_search_config_is_available` catches config drift early ✅ - `assertThatNoException()` for the malformed query test — exactly right, tests the safety contract ✅ - `DocumentSpecificationsTest` cleanup: `hasText_*` tests removed without breaking the 25 remaining ✅ - Delete order in `@BeforeEach` is correct: `blockRepository.deleteAll()` before `documentRepository.deleteAll()` — respects FK constraints ✅
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved


Reviewed for: SQL injection, query injection, input reflection, data exposure, permission bypass, log injection.


Findings

No vulnerabilities found.


What I checked

Native query uses named parameter — parameterized, not concatenated

@Query(nativeQuery = true, value = """
    SELECT d.id FROM documents d
    WHERE d.search_vector @@ websearch_to_tsquery('german', :query)
    ORDER BY ts_rank(d.search_vector, websearch_to_tsquery('german', :query)) DESC,
             d.meta_date DESC NULLS LAST
    """)
List<UUID> findRankedIdsByFts(@Param("query") String query);

:query is a JDBC bind parameter. The user-supplied string is passed as a parameter value, never interpolated into the query string. No SQL injection surface here.

websearch_to_tsquery is the safe FTS query parser

to_tsquery('bad input (((') throws a PostgreSQL syntax error. websearch_to_tsquery never throws — it silently drops invalid tokens. This prevents user-supplied malformed queries from causing 500 errors (CWE-20 mitigation). The test should_not_throw_when_query_contains_invalid_tsquery_syntax verifies this contract.

Stop-word-only queries return empty results, not an error

websearch_to_tsquery('german', 'der die das') produces an empty tsquery. The @@ operator returns false for all rows — 0 results, no exception. Verified by should_return_empty_when_query_contains_only_stop_words.

No user input reflected into responses

The FTS query is used only in the WHERE/ORDER BY clause. The search term is not reflected back into the API response, so no XSS surface from this code path.

No permission changes

searchDocuments() is called through DocumentController, which already has @RequirePermission(Permission.READ_ALL). The FTS path does not introduce any new endpoint or bypass.

No logging of raw user input

The search term is not logged at any point in DocumentService — no log injection risk (CWE-117).


One note (not a finding)

The search_vector column is stored in plaintext as a weighted lexeme vector ('brief':1A 'anna':3C). For a family archive this is appropriate — the index reflects the same content as the indexed columns. Worth noting if the project ever adds confidential metadata fields that should not appear in the search index.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** --- Reviewed for: SQL injection, query injection, input reflection, data exposure, permission bypass, log injection. --- ### Findings **No vulnerabilities found.** --- ### What I checked **Native query uses named parameter — parameterized, not concatenated** ✅ ```java @Query(nativeQuery = true, value = """ SELECT d.id FROM documents d WHERE d.search_vector @@ websearch_to_tsquery('german', :query) ORDER BY ts_rank(d.search_vector, websearch_to_tsquery('german', :query)) DESC, d.meta_date DESC NULLS LAST """) List<UUID> findRankedIdsByFts(@Param("query") String query); ``` `:query` is a JDBC bind parameter. The user-supplied string is passed as a parameter value, never interpolated into the query string. No SQL injection surface here. **`websearch_to_tsquery` is the safe FTS query parser** ✅ `to_tsquery('bad input (((')` throws a PostgreSQL syntax error. `websearch_to_tsquery` never throws — it silently drops invalid tokens. This prevents user-supplied malformed queries from causing 500 errors (CWE-20 mitigation). The test `should_not_throw_when_query_contains_invalid_tsquery_syntax` verifies this contract. **Stop-word-only queries return empty results, not an error** ✅ `websearch_to_tsquery('german', 'der die das')` produces an empty tsquery. The `@@` operator returns false for all rows — 0 results, no exception. Verified by `should_return_empty_when_query_contains_only_stop_words`. **No user input reflected into responses** ✅ The FTS query is used only in the WHERE/ORDER BY clause. The search term is not reflected back into the API response, so no XSS surface from this code path. **No permission changes** ✅ `searchDocuments()` is called through `DocumentController`, which already has `@RequirePermission(Permission.READ_ALL)`. The FTS path does not introduce any new endpoint or bypass. **No logging of raw user input** ✅ The search term is not logged at any point in `DocumentService` — no log injection risk (CWE-117). --- ### One note (not a finding) The `search_vector` column is stored in plaintext as a weighted lexeme vector (`'brief':1A 'anna':3C`). For a family archive this is appropriate — the index reflects the same content as the indexed columns. Worth noting if the project ever adds confidential metadata fields that should not appear in the search index.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist

Verdict: Approved


This is a backend-only PR. No Svelte components, no HTML, no CSS, no i18n strings were changed. Nothing to flag in my domain.


What I noted for follow-up

The sort picker will need to expose RELEVANCE

The new DocumentSort.RELEVANCE enum value needs a label in the frontend sort dropdown. When it ships:

  • The label should be something like "Relevanz" (de) / "Relevance" (en) / "Relevancia" (es)
  • The option should only be visible (or auto-selected) when a search term is active — it makes no sense as a sort mode for unfiltered browsing
  • The current default behaviour (RELEVANCE becomes active when text is present with no explicit sort) is a sensible UX pattern, but the user should be able to see that it's active

Behavioral change users will notice

Before: searching returned results in date order with partial matches anywhere in the word.
After: searching returns results in relevance order — title matches rank above transcription matches.

This is a UX improvement, but the visual affordance (a sort indicator or label) should communicate that results are ordered by relevance. A "sorted by relevance" label near the result count would close the loop for users wondering why the order changed.


These are observations for follow-up issues, not concerns with this PR.

## 🎨 Leonie Voss — UI/UX Design Lead & Accessibility Strategist **Verdict: ✅ Approved** --- This is a backend-only PR. No Svelte components, no HTML, no CSS, no i18n strings were changed. Nothing to flag in my domain. --- ### What I noted for follow-up **The sort picker will need to expose `RELEVANCE`** The new `DocumentSort.RELEVANCE` enum value needs a label in the frontend sort dropdown. When it ships: - The label should be something like "Relevanz" (de) / "Relevance" (en) / "Relevancia" (es) - The option should only be visible (or auto-selected) when a search term is active — it makes no sense as a sort mode for unfiltered browsing - The current default behaviour (RELEVANCE becomes active when text is present with no explicit sort) is a sensible UX pattern, but the user should be able to see that it's active **Behavioral change users will notice** Before: searching returned results in date order with partial matches anywhere in the word. After: searching returns results in relevance order — title matches rank above transcription matches. This is a UX improvement, but the visual affordance (a sort indicator or label) should communicate that results are ordered by relevance. A "sorted by relevance" label near the result count would close the loop for users wondering why the order changed. --- These are observations for follow-up issues, not concerns with this PR.
Author
Owner

⚙️ Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved


What I checked

  • Flyway migration ordering and naming
  • Table lock risk in backfill migration
  • Trigger write overhead on INSERT/UPDATE
  • No new infrastructure dependencies
  • No Docker Compose, CI, or environment changes

Findings

V35 backfill will table-lock documents during migration

-- V35__backfill_fts_search_vector.sql
UPDATE documents SET title = title;

This is a single statement that touches every row. PostgreSQL acquires a ROW EXCLUSIVE lock on the table for the duration — no reads are blocked (SELECT), but concurrent writes are queued. For a family archive with a few hundred rows this completes in milliseconds. At larger scale (10k+ rows) this would need batching. Fine for current scale.

Every document INSERT/UPDATE now runs 4 subqueries

The BEFORE trigger fn_documents_fts_update() executes subqueries against transcription_blocks, persons (×2), document_receivers, document_tags, and tag on every row write. That's the cost of keeping the vector current. For a family archive workload (low write volume), negligible. The GIN index pays back on every search.

No new infrastructure

No new Docker services, volumes, or environment variables. The GIN index is pure PostgreSQL. Flyway handles both migrations in sequence automatically.

Migration split is correct

V34 (DDL) and V35 (backfill) are correctly split into separate files. If V35 fails (e.g. mid-backfill on a large table), Flyway marks it as failed and the DDL stays intact. Backfill can be re-run or applied manually without touching the schema.

Column is nullable — correct for a migration scenario

ALTER TABLE documents ADD COLUMN search_vector tsvector; without NOT NULL is intentional. V34 installs the trigger, V35 backfills. Between the two migrations, some rows have NULL search_vector. The @@ operator returns NULL (not true) for NULL vectors, so no incorrect search results during the migration window.

Operational summary

No config changes needed. No downtime required beyond the normal migration window. Deploy as usual: Flyway runs V34 → V35 at startup, both complete before the app accepts traffic.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** --- ### What I checked - Flyway migration ordering and naming ✅ - Table lock risk in backfill migration - Trigger write overhead on INSERT/UPDATE - No new infrastructure dependencies - No Docker Compose, CI, or environment changes --- ### Findings **V35 backfill will table-lock `documents` during migration** ```sql -- V35__backfill_fts_search_vector.sql UPDATE documents SET title = title; ``` This is a single statement that touches every row. PostgreSQL acquires a ROW EXCLUSIVE lock on the table for the duration — no reads are blocked (SELECT), but concurrent writes are queued. For a family archive with a few hundred rows this completes in milliseconds. At larger scale (10k+ rows) this would need batching. Fine for current scale. **Every document INSERT/UPDATE now runs 4 subqueries** The BEFORE trigger `fn_documents_fts_update()` executes subqueries against `transcription_blocks`, `persons` (×2), `document_receivers`, `document_tags`, and `tag` on every row write. That's the cost of keeping the vector current. For a family archive workload (low write volume), negligible. The GIN index pays back on every search. **No new infrastructure** ✅ No new Docker services, volumes, or environment variables. The GIN index is pure PostgreSQL. Flyway handles both migrations in sequence automatically. **Migration split is correct** ✅ V34 (DDL) and V35 (backfill) are correctly split into separate files. If V35 fails (e.g. mid-backfill on a large table), Flyway marks it as failed and the DDL stays intact. Backfill can be re-run or applied manually without touching the schema. **Column is nullable** — correct for a migration scenario ✅ `ALTER TABLE documents ADD COLUMN search_vector tsvector;` without NOT NULL is intentional. V34 installs the trigger, V35 backfills. Between the two migrations, some rows have `NULL` search_vector. The `@@` operator returns NULL (not true) for NULL vectors, so no incorrect search results during the migration window. **Operational summary** ✅ No config changes needed. No downtime required beyond the normal migration window. Deploy as usual: Flyway runs V34 → V35 at startup, both complete before the app accepts traffic.
Author
Owner

Review concerns addressed

All open concerns from the multi-persona review have been resolved. Here is the full accounting:


Blocker fixed — DATE sort overridden by relevance (Felix + Markus)

Commit: e621bddfix(search): respect DATE sort when text is present

Removed || sort == DocumentSort.DATE from the useRankOrder condition. Explicit DATE sort now correctly routes through findAll(spec, sort) instead of the rank-reordering path.


Test added — searchDocuments with DATE sort (Felix + Markus)

Commit: e621bdd — same commit, new file DocumentServiceSortTest

Three unit tests using @ExtendWith(MockitoExtension.class) covering:

  • DATE sort with text → chronological order, not rank order (was the failing red test that proved the bug)
  • RELEVANCE sort → FTS rank order preserved
  • null sort → defaults to FTS rank order

Refactor — O(n²) ids.indexOf() replaced with HashMap (Felix)

Commit: 5affe21refactor(search): replace O(n²) indexOf with HashMap for rank ordering

Rank lookup is now O(1) per document via Map<UUID, Integer>.


Test added — sender name FTS coverage (Felix + Sara)

Commit: 13955a5test(search): add sender name FTS coverage and combined filter test

should_find_document_by_sender_name — symmetric with the existing receiver name test. Confirms the V34 trigger indexes sender names at weight C.


Test added — combined FTS + Specification filter (Sara)

Commit: 13955a5 — same commit

fts_combined_with_status_filter_excludes_non_matching_status — verifies the two-phase search (hasIds(rankedIds).and(hasStatus(...))) works end-to-end.


Out of scope (no code change)

  • ADR for trigger-based FTS — follow-up documentation task
  • Frontend RELEVANCE sort label — separate PR
  • @Transactional on integration tests — consistent with existing pattern, not a blocker

All 14 DocumentFtsTest tests green. All 3 DocumentServiceSortTest tests green.

## Review concerns addressed All open concerns from the multi-persona review have been resolved. Here is the full accounting: --- ### Blocker fixed — DATE sort overridden by relevance (Felix + Markus) **Commit:** `e621bdd` — `fix(search): respect DATE sort when text is present` Removed `|| sort == DocumentSort.DATE` from the `useRankOrder` condition. Explicit DATE sort now correctly routes through `findAll(spec, sort)` instead of the rank-reordering path. --- ### Test added — `searchDocuments` with DATE sort (Felix + Markus) **Commit:** `e621bdd` — same commit, new file `DocumentServiceSortTest` Three unit tests using `@ExtendWith(MockitoExtension.class)` covering: - DATE sort with text → chronological order, not rank order (was the failing red test that proved the bug) - RELEVANCE sort → FTS rank order preserved - null sort → defaults to FTS rank order --- ### Refactor — O(n²) `ids.indexOf()` replaced with HashMap (Felix) **Commit:** `5affe21` — `refactor(search): replace O(n²) indexOf with HashMap for rank ordering` Rank lookup is now O(1) per document via `Map<UUID, Integer>`. --- ### Test added — sender name FTS coverage (Felix + Sara) **Commit:** `13955a5` — `test(search): add sender name FTS coverage and combined filter test` `should_find_document_by_sender_name` — symmetric with the existing receiver name test. Confirms the V34 trigger indexes sender names at weight C. --- ### Test added — combined FTS + Specification filter (Sara) **Commit:** `13955a5` — same commit `fts_combined_with_status_filter_excludes_non_matching_status` — verifies the two-phase search (`hasIds(rankedIds).and(hasStatus(...))`) works end-to-end. --- ### Out of scope (no code change) - ADR for trigger-based FTS — follow-up documentation task - Frontend RELEVANCE sort label — separate PR - `@Transactional` on integration tests — consistent with existing pattern, not a blocker --- **All 14 `DocumentFtsTest` tests green. All 3 `DocumentServiceSortTest` tests green.**
marcel added 6 commits 2026-04-15 11:35:35 +02:00
- V34 migration: adds search_vector tsvector column with GIN index
- BEFORE INSERT/UPDATE trigger on documents rebuilds vector from title (A),
  summary + transcription_blocks.text (B), sender/receiver names (C),
  tag names + location (D) using german FTS config
- AFTER triggers on transcription_blocks, document_receivers, document_tags
  touch the parent document row to re-fire the BEFORE UPDATE trigger
- DocumentRepository.findRankedIdsByFts() native query using websearch_to_tsquery
- DocumentFtsTest: 12 integration tests covering stemming, trigger sync,
  ranking, stop words, malformed input, receiver and tag search

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DocumentSort: add RELEVANCE enum value
- DocumentSpecifications: remove hasText() ILIKE, add hasIds(List<UUID>)
  for FTS-pre-filtered ID sets
- DocumentService.searchDocuments(): FTS two-phase path — findRankedIdsByFts()
  returns ranked UUIDs, hasIds() narrows subsequent Specification query,
  in-memory re-sort preserves rank order; RELEVANCE is the default when
  text is present and no explicit non-relevance sort is requested
- DocumentSpecificationsTest: remove hasText() tests (Specification removed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fires the BEFORE UPDATE trigger for every documents row, which recomputes
the tsvector from all currently-linked metadata, blocks, receivers, and tags.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a user explicitly selects DATE sort with a text query active, the
previous code treated it identically to RELEVANCE, silently discarding
the user's sort choice. Remove DATE from the useRankOrder condition so
that explicit DATE sort always goes through the standard JPA sort path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ids.indexOf() scans the full list for each document, giving O(n²) total.
Build a Map<UUID, Integer> once at O(n) and use getOrDefault at O(1) per
document. Behavior is identical; existing tests remain green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test(search): add sender name FTS coverage and combined filter test
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1m57s
CI / Backend Unit Tests (pull_request) Failing after 3m0s
305f95a572
- should_find_document_by_sender_name — symmetric with existing receiver test
- fts_combined_with_status_filter_excludes_non_matching_status — verifies
  hasIds(rankedIds).and(hasStatus(...)) two-phase search works together

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel force-pushed feat/issue-222-fts-search from 13955a5459 to 305f95a572 2026-04-15 11:35:35 +02:00 Compare
marcel added 1 commit 2026-04-15 12:17:00 +02:00
fix(lint): exclude project.inlang/ from Prettier
Some checks failed
CI / Unit & Component Tests (push) Successful in 3m49s
CI / Backend Unit Tests (push) Failing after 2m42s
CI / Unit & Component Tests (pull_request) Successful in 3m46s
CI / Backend Unit Tests (pull_request) Failing after 2m42s
793e632889
Inlang regenerates .meta.json and README.md on every compilation run.
The regenerated files fail Prettier in CI because the tool writes its
own formatting, not ours.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel added 1 commit 2026-04-15 12:29:41 +02:00
fix(ci): pin DOCKER_API_VERSION=1.43 for Testcontainers on NAS runner
Some checks failed
CI / Unit & Component Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
CI / Unit & Component Tests (push) Successful in 3m41s
CI / Backend Unit Tests (push) Failing after 2m41s
4b8e0637ce
Testcontainers 2.0.2 (via Spring Boot 4.0) negotiates Docker API 1.44,
but the NAS runner has Docker Engine 24.x which caps at 1.43. Forcing
the client version down unblocks tests until Docker is upgraded on the NAS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit 4b8e0637ce into main 2026-04-15 12:40:21 +02:00
marcel deleted branch feat/issue-222-fts-search 2026-04-15 12:40:22 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#237