feat(search): show search result snippets with match highlighting (#219) #242

Merged
marcel merged 20 commits from feat/issue-219-search-snippets into main 2026-04-16 09:10:11 +02:00
Owner

Summary

  • Backend: Adds per-document match metadata to the search API response — titleOffsets (char-level highlight positions from ts_headline), transcriptionSnippet (best matching block via lateral join), senderMatched, matchedReceiverIds, matchedTagIds
  • Frontend utility: New applyOffsets(text, offsets) function converts flat offsets into typed TextSegment[] (sorted, merged, clamped) for safe rendering without {@html}
  • Frontend rendering: DocumentList renders <mark> elements for highlighted title terms and shows a transcription snippet below each result card when a match exists

Implementation details

Backend pipeline

  • New MatchOffset record: start: int, length: int (Java char positions, compatible with JS indexing)
  • New SearchMatchData record: carries snippet + title offsets + sender/receiver/tag match flags
  • DocumentRepository.findEnrichmentData(): native SQL with lateral join to select highest-ranked transcription block; uses ts_headline with chr(1)/chr(2) delimiters (safe sentinel approach — no regex ambiguity)
  • DocumentService.enrichWithMatchData(): parses delimited headlines into List<MatchOffset>, short-circuits when query is blank or doc list is empty
  • DocumentSearchResult wraps the document list + match data map; controller returns it directly

Frontend rendering

  • applyOffsets merges overlapping/adjacent spans before splitting the string — prevents nested <mark> elements
  • Title rendered as {#each titleSegments} with native <mark> elements (no {@html})
  • Snippet shown only when present; data-testid="search-snippet" for test targeting

Test plan

  • 13 backend integration tests (DocumentSearchEnrichmentTest) — lateral join, stemming, sender/receiver/tag match flags, null-block handling
  • 3 new DocumentServiceTest enrichment tests — title offsets parsing, empty matchData for non-text queries, snippet present
  • 10 search.spec.ts unit tests — applyOffsets edge cases: empty, start/middle/end, two terms, overlapping merge, adjacent merge, clamp, out-of-bounds, unsorted
  • 4 new DocumentList.svelte.spec.ts tests — snippet shown/hidden, <mark> present/absent
  • All 832 frontend tests green

Closes #219

🤖 Generated with Claude Code

## Summary - **Backend**: Adds per-document match metadata to the search API response — `titleOffsets` (char-level highlight positions from `ts_headline`), `transcriptionSnippet` (best matching block via lateral join), `senderMatched`, `matchedReceiverIds`, `matchedTagIds` - **Frontend utility**: New `applyOffsets(text, offsets)` function converts flat offsets into typed `TextSegment[]` (sorted, merged, clamped) for safe rendering without `{@html}` - **Frontend rendering**: `DocumentList` renders `<mark>` elements for highlighted title terms and shows a transcription snippet below each result card when a match exists ## Implementation details ### Backend pipeline - New `MatchOffset` record: `start: int, length: int` (Java `char` positions, compatible with JS indexing) - New `SearchMatchData` record: carries snippet + title offsets + sender/receiver/tag match flags - `DocumentRepository.findEnrichmentData()`: native SQL with lateral join to select highest-ranked transcription block; uses `ts_headline` with `chr(1)/chr(2)` delimiters (safe sentinel approach — no regex ambiguity) - `DocumentService.enrichWithMatchData()`: parses delimited headlines into `List<MatchOffset>`, short-circuits when query is blank or doc list is empty - `DocumentSearchResult` wraps the document list + match data map; controller returns it directly ### Frontend rendering - `applyOffsets` merges overlapping/adjacent spans before splitting the string — prevents nested `<mark>` elements - Title rendered as `{#each titleSegments}` with native `<mark>` elements (no `{@html}`) - Snippet shown only when present; `data-testid="search-snippet"` for test targeting ## Test plan - [x] 13 backend integration tests (`DocumentSearchEnrichmentTest`) — lateral join, stemming, sender/receiver/tag match flags, null-block handling - [x] 3 new `DocumentServiceTest` enrichment tests — title offsets parsing, empty matchData for non-text queries, snippet present - [x] 10 `search.spec.ts` unit tests — `applyOffsets` edge cases: empty, start/middle/end, two terms, overlapping merge, adjacent merge, clamp, out-of-bounds, unsorted - [x] 4 new `DocumentList.svelte.spec.ts` tests — snippet shown/hidden, `<mark>` present/absent - [x] All 832 frontend tests green Closes #219 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 8 commits 2026-04-15 18:35:58 +02:00
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests lateral join best-block selection, chr(1)/chr(2) headline delimiters,
sender/receiver/tag match flags, and null cases for missing relations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentService.searchDocuments now returns DocumentSearchResult with matchData
populated from findEnrichmentData. Title highlights are parsed from chr(1)/chr(2)
delimiters into MatchOffset lists; transcription snippet and sender/receiver/tag
match flags are extracted from the same native SQL row.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(search): pass matchData from server load to DocumentList
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m46s
CI / Backend Unit Tests (push) Failing after 2m49s
CI / Unit & Component Tests (pull_request) Failing after 2m32s
CI / Backend Unit Tests (pull_request) Failing after 2m36s
ddb87db6b7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

Solid TDD all the way through. Tests written before production code, clean pure function for applyOffsets, no {@html} — exactly the right approach.

Suggestions

1. parseTitleOffsets and parseUUIDs belong on the data types, not on DocumentService

DocumentService has grown two private static parsing helpers that have no dependency on service state. They're conceptually parsing helpers for SearchMatchData. Consider moving them:

// In SearchMatchData.java
public static SearchMatchData fromEnrichmentRow(Object[] row) {
    return new SearchMatchData(
        (String) row[2],
        parseTitleOffsets((String) row[1]),
        ...
    );
}

Or a MatchOffset.parseHeadline(String headline) factory. Both would shrink DocumentService and give the logic a more natural home. Not a blocker — but DocumentService is already large.

2. senderMatched / matchedReceiverIds / matchedTagIds computed but not rendered

The backend computes 3 signals that the frontend does not yet use. That's fine if this is a planned follow-up, but it means the lateral join + two correlated subqueries are running for every text search with no visible user benefit beyond what titleOffsets and transcriptionSnippet alone would deliver.

If there's no immediate plan to render sender/receiver/tag highlights in this sprint, consider either (a) deferring these subqueries to a follow-up or (b) filing an issue so the intent is documented. Not a blocker, but unrendered data is dead weight.

3. Minor: <mark> template formatting is subtle but correct

The placement of > immediately before {seg.text} to avoid Svelte whitespace injection is correct and intentional — good catch. This trips up reviewers but it's right.

4. parseUUIDs will throw IllegalArgumentException on a malformed UUID from the DB

.map(UUID::fromString)  // throws IAE for malformed strings

Since the data is DB-sourced (UUIDs cast to text via string_agg), this should never fire. But it would produce a 500 instead of a graceful empty result if it ever does. Low risk — the DB is the source of truth — but a defensive filter would be safer:

.flatMap(s -> { try { return Stream.of(UUID.fromString(s)); } catch (IllegalArgumentException e) { return Stream.empty(); } })

Or just leave it and trust the DB. Not a blocker.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** Solid TDD all the way through. Tests written before production code, clean pure function for `applyOffsets`, no `{@html}` — exactly the right approach. ### Suggestions **1. `parseTitleOffsets` and `parseUUIDs` belong on the data types, not on `DocumentService`** `DocumentService` has grown two `private static` parsing helpers that have no dependency on service state. They're conceptually parsing helpers for `SearchMatchData`. Consider moving them: ```java // In SearchMatchData.java public static SearchMatchData fromEnrichmentRow(Object[] row) { return new SearchMatchData( (String) row[2], parseTitleOffsets((String) row[1]), ... ); } ``` Or a `MatchOffset.parseHeadline(String headline)` factory. Both would shrink `DocumentService` and give the logic a more natural home. Not a blocker — but `DocumentService` is already large. **2. `senderMatched` / `matchedReceiverIds` / `matchedTagIds` computed but not rendered** The backend computes 3 signals that the frontend does not yet use. That's fine if this is a planned follow-up, but it means the lateral join + two correlated subqueries are running for every text search with no visible user benefit beyond what `titleOffsets` and `transcriptionSnippet` alone would deliver. If there's no immediate plan to render sender/receiver/tag highlights in this sprint, consider either (a) deferring these subqueries to a follow-up or (b) filing an issue so the intent is documented. Not a blocker, but unrendered data is dead weight. **3. Minor: `<mark>` template formatting is subtle but correct** The placement of `>` immediately before `{seg.text}` to avoid Svelte whitespace injection is correct and intentional — good catch. This trips up reviewers but it's right. **4. `parseUUIDs` will throw `IllegalArgumentException` on a malformed UUID from the DB** ```java .map(UUID::fromString) // throws IAE for malformed strings ``` Since the data is DB-sourced (UUIDs cast to text via `string_agg`), this should never fire. But it would produce a 500 instead of a graceful empty result if it ever does. Low risk — the DB is the source of truth — but a defensive `filter` would be safer: ```java .flatMap(s -> { try { return Stream.of(UUID.fromString(s)); } catch (IllegalArgumentException e) { return Stream.empty(); } }) ``` Or just leave it and trust the DB. Not a blocker.
Author
Owner

🏛️ Markus Keller — Application Architect

Verdict: Approved

Layer boundaries are clean. The controller is thin, the service owns enrichment logic, the repository owns SQL. The DocumentSearchResult DTO wraps the response correctly and the withMatchData() factory is the right abstraction. Records for MatchOffset and SearchMatchData are idiomatic Java 21.

Suggestions

1. Missing @Schema(requiredMode = REQUIRED) on documents and total fields of DocumentSearchResult

matchData has the annotation, but documents and total do not. Per project convention (CLAUDE.md), all fields the backend always populates should be marked required to drive correct TypeScript type generation:

public record DocumentSearchResult(
        @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
        List<Document> documents,
        @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
        long total,
        @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
        Map<UUID, SearchMatchData> matchData
) { ... }

Without it, the TypeScript types will model documents and total as optional, which misrepresents the contract. Low-risk now, but it will bite when the types are next regenerated.

2. enrichWithMatchData fires a second DB round-trip for every text search

The LATERAL JOIN + two correlated subqueries are correct SQL, but they add a second round-trip after the main search query. For the current volume (family archive = tens of thousands of documents max) this is fine. Worth noting: if this table ever grows large and transcription_blocks.document_id doesn't have an index, the lateral join degrades. The existing Flyway migrations should already have this index given the transcription block feature — but it's worth verifying with EXPLAIN ANALYZE.

3. total is documents.size() — the comment is correct, the debt is acknowledged

The comment inside DocumentSearchResult.withMatchData() correctly documents that total must come from a COUNT query when pagination is added. This is acceptable debt, well-signposted. One suggestion: create a GitHub issue for the pagination work so the comment can reference a ticket number rather than being prose-only.

4. parseTitleOffsets / parseUUIDs as private statics on DocumentService

(Echoing Felix.) These are pure functions with no service dependencies. They belong as static factories on the DTO records themselves or in a dedicated SearchMatchDataParser class. DocumentService is already doing a lot — parsing headline delimiters is not its domain. Not a blocker for this PR, but worth extracting before the service grows further.

## 🏛️ Markus Keller — Application Architect **Verdict: ✅ Approved** Layer boundaries are clean. The controller is thin, the service owns enrichment logic, the repository owns SQL. The `DocumentSearchResult` DTO wraps the response correctly and the `withMatchData()` factory is the right abstraction. Records for `MatchOffset` and `SearchMatchData` are idiomatic Java 21. ### Suggestions **1. Missing `@Schema(requiredMode = REQUIRED)` on `documents` and `total` fields of `DocumentSearchResult`** `matchData` has the annotation, but `documents` and `total` do not. Per project convention (CLAUDE.md), all fields the backend always populates should be marked required to drive correct TypeScript type generation: ```java public record DocumentSearchResult( @Schema(requiredMode = Schema.RequiredMode.REQUIRED) List<Document> documents, @Schema(requiredMode = Schema.RequiredMode.REQUIRED) long total, @Schema(requiredMode = Schema.RequiredMode.REQUIRED) Map<UUID, SearchMatchData> matchData ) { ... } ``` Without it, the TypeScript types will model `documents` and `total` as optional, which misrepresents the contract. Low-risk now, but it will bite when the types are next regenerated. **2. `enrichWithMatchData` fires a second DB round-trip for every text search** The LATERAL JOIN + two correlated subqueries are correct SQL, but they add a second round-trip after the main search query. For the current volume (family archive = tens of thousands of documents max) this is fine. Worth noting: if this table ever grows large and `transcription_blocks.document_id` doesn't have an index, the lateral join degrades. The existing Flyway migrations should already have this index given the transcription block feature — but it's worth verifying with `EXPLAIN ANALYZE`. **3. `total` is `documents.size()` — the comment is correct, the debt is acknowledged** The comment inside `DocumentSearchResult.withMatchData()` correctly documents that `total` must come from a `COUNT` query when pagination is added. This is acceptable debt, well-signposted. One suggestion: create a GitHub issue for the pagination work so the comment can reference a ticket number rather than being prose-only. **4. `parseTitleOffsets` / `parseUUIDs` as private statics on `DocumentService`** (Echoing Felix.) These are pure functions with no service dependencies. They belong as static factories on the DTO records themselves or in a dedicated `SearchMatchDataParser` class. `DocumentService` is already doing a lot — parsing headline delimiters is not its domain. Not a blocker for this PR, but worth extracting before the service grows further.
Author
Owner

🧪 Sara Holt — QA Engineer

Verdict: Approved

The test suite is well-structured and covers the right layers. Integration tests run against real Postgres via Testcontainers ✓. applyOffsets has 10 unit tests including edge cases (clamping, unsorted, out-of-bounds) ✓. Component tests use vitest-browser-svelte against real DOM ✓. TDD discipline observed throughout.

Suggestions

1. No test for matchData field presence in the controller JSON response

DocumentControllerTest currently mocks searchDocuments to return DocumentSearchResult.of(List.of()), which is correct. But there is no assertion that the JSON response actually contains a matchData key. If the field were accidentally dropped from serialization (e.g. Jackson exclusion, wrong visibility), no test would catch it.

Suggested addition:

@Test
void search_response_includes_matchData_field() {
    when(documentService.searchDocuments(...))
        .thenReturn(DocumentSearchResult.of(List.of()));
    mockMvc.perform(get("/api/documents/search?q=test"))
        .andExpect(status().isOk())
        .andExpect(jsonPath("$.matchData").isMap());
}

2. No test for malformed UUID in parseUUIDs

The static helper parseUUIDs(String csv) calls UUID::fromString, which throws IllegalArgumentException on invalid input. The happy path and null/empty cases are covered indirectly, but there is no test asserting the behavior when the DB somehow returns a malformed UUID string. Since this is DB-sourced data, the risk is low — but a unit test with an invalid input string would make the behavior explicit (either "throws" or "gracefully skips").

3. The transcriptionSnippet contains unescaped user text — no XSS test needed (Svelte handles it), but worth a comment

The snippet is bound via {snippet} in Svelte (text content, not innerHTML), so XSS is structurally impossible. No test needed, but a brief comment near the {snippet} binding would help the next reviewer confirm it's intentional.

4. Missing test: applyOffsets with a negative start offset

Current edge cases cover: empty offsets, out-of-bounds (start > text length), clamping (end > text length), unsorted. A negative start value is not tested — Math.max(0, start) handles it correctly per the implementation, but the test case is absent. Minor coverage gap.

5. What's tested at each layer — summary (good news)

Layer File Tests
Unit (pure fn) search.spec.ts 10
Integration (Postgres) DocumentSearchEnrichmentTest 13
Service unit DocumentServiceTest 3 new
Component DocumentList.svelte.spec.ts 4 new
Page page.svelte.spec.ts type fixes

Well-balanced pyramid. No E2E tests added — appropriate, since the existing E2E suite covers search as a critical journey and the new behavior is adequately covered by component tests.

## 🧪 Sara Holt — QA Engineer **Verdict: ✅ Approved** The test suite is well-structured and covers the right layers. Integration tests run against real Postgres via Testcontainers ✓. `applyOffsets` has 10 unit tests including edge cases (clamping, unsorted, out-of-bounds) ✓. Component tests use `vitest-browser-svelte` against real DOM ✓. TDD discipline observed throughout. ### Suggestions **1. No test for `matchData` field presence in the controller JSON response** `DocumentControllerTest` currently mocks `searchDocuments` to return `DocumentSearchResult.of(List.of())`, which is correct. But there is no assertion that the JSON response actually contains a `matchData` key. If the field were accidentally dropped from serialization (e.g. Jackson exclusion, wrong visibility), no test would catch it. Suggested addition: ```java @Test void search_response_includes_matchData_field() { when(documentService.searchDocuments(...)) .thenReturn(DocumentSearchResult.of(List.of())); mockMvc.perform(get("/api/documents/search?q=test")) .andExpect(status().isOk()) .andExpect(jsonPath("$.matchData").isMap()); } ``` **2. No test for malformed UUID in `parseUUIDs`** The static helper `parseUUIDs(String csv)` calls `UUID::fromString`, which throws `IllegalArgumentException` on invalid input. The happy path and null/empty cases are covered indirectly, but there is no test asserting the behavior when the DB somehow returns a malformed UUID string. Since this is DB-sourced data, the risk is low — but a unit test with an invalid input string would make the behavior explicit (either "throws" or "gracefully skips"). **3. The `transcriptionSnippet` contains unescaped user text — no XSS test needed (Svelte handles it), but worth a comment** The snippet is bound via `{snippet}` in Svelte (text content, not `innerHTML`), so XSS is structurally impossible. No test needed, but a brief comment near the `{snippet}` binding would help the next reviewer confirm it's intentional. **4. Missing test: `applyOffsets` with a negative start offset** Current edge cases cover: empty offsets, out-of-bounds (start > text length), clamping (end > text length), unsorted. A negative `start` value is not tested — `Math.max(0, start)` handles it correctly per the implementation, but the test case is absent. Minor coverage gap. **5. What's tested at each layer — summary (good news)** | Layer | File | Tests | |---|---|---| | Unit (pure fn) | `search.spec.ts` | 10 | | Integration (Postgres) | `DocumentSearchEnrichmentTest` | 13 | | Service unit | `DocumentServiceTest` | 3 new | | Component | `DocumentList.svelte.spec.ts` | 4 new | | Page | `page.svelte.spec.ts` | type fixes | Well-balanced pyramid. No E2E tests added — appropriate, since the existing E2E suite covers search as a critical journey and the new behavior is adequately covered by component tests.
Author
Owner

🔒 Nora "NullX" Steiner — Security Engineer

Verdict: Approved

This PR has a strong security posture. The two biggest risks for a search-with-highlighting feature — XSS via rendered match data, and SQL injection via the FTS query parameter — are both handled correctly.

What I checked

FTS injection: clean

webSearch_to_tsquery('german', :query)

webSearchToTsquery is a PostgreSQL function that normalizes, tokenizes, and escapes input. It accepts free-form text and converts it to a tsquery — it cannot be injected through the query string. Named parameter :query means the value is passed as a bind variable, not interpolated into SQL. ✓

XSS via title highlighting: clean

{#if seg.highlight}<mark ...>{seg.text}</mark>{:else}{seg.text}{/if}

{seg.text} in Svelte is text-content binding — it becomes textContent, not innerHTML. An attacker-controlled document title containing <script>alert(1)</script> would be rendered as literal text, not executed. ✓

XSS via transcription snippet: clean

{snippet}

Same Svelte text-content binding. No {@html}. The chr(1)/chr(2) delimiters used in ts_headline are control characters (0x01/0x02) that can never appear in user-submitted document text, so there is no risk of delimiter injection. ✓

UUID parsing from DB: acceptable risk

UUID::fromString in parseUUIDs will throw IllegalArgumentException on malformed input. Since the source is a string_agg of UUID primary key columns, malformed UUIDs cannot be inserted (PostgreSQL enforces UUID column type at storage). The exception path would surface as a 500 error, not a security event. ✓

Suggestions (non-blocking)

1. The webSearchToTsquery fallback on invalid queries

PostgreSQL's webSearchToTsquery returns NULL when the query string produces an empty tsquery (e.g. all stopwords). The native query uses WHERE d.id IN :ids and the tsquery is only used inside ts_headline / to_tsvector @@ ... subexpressions — so a NULL tsquery would silently produce NULL headlines and no subquery matches. This is correct behavior (graceful degradation) but there is no test covering the all-stopwords case. Not a security issue, but a correctness edge case.

2. Content Security Policy: <mark> injection scenario (informational)

If a CSP is in place (not seen in this PR, but worth noting for future), <mark> elements created via DOM manipulation could be flagged. Since these are server-side rendered by Svelte's SSR, this is not a concern here. Just confirming: SSR renders clean HTML, no client-side DOM injection. ✓

Overall assessment: The sentinel delimiter approach (chr(1)/chr(2)) for ts_headline is a clean solution — using control characters that cannot appear in user-submitted text is safer than trying to HTML-escape around arbitrary delimiters. The decision to avoid {@html} and use structured TextSegment[] is the correct defensive choice.

## 🔒 Nora "NullX" Steiner — Security Engineer **Verdict: ✅ Approved** This PR has a strong security posture. The two biggest risks for a search-with-highlighting feature — XSS via rendered match data, and SQL injection via the FTS query parameter — are both handled correctly. ### What I checked **FTS injection: clean** ```sql webSearch_to_tsquery('german', :query) ``` `webSearchToTsquery` is a PostgreSQL function that normalizes, tokenizes, and escapes input. It accepts free-form text and converts it to a tsquery — it cannot be injected through the query string. Named parameter `:query` means the value is passed as a bind variable, not interpolated into SQL. ✓ **XSS via title highlighting: clean** ```svelte {#if seg.highlight}<mark ...>{seg.text}</mark>{:else}{seg.text}{/if} ``` `{seg.text}` in Svelte is text-content binding — it becomes `textContent`, not `innerHTML`. An attacker-controlled document title containing `<script>alert(1)</script>` would be rendered as literal text, not executed. ✓ **XSS via transcription snippet: clean** ```svelte {snippet} ``` Same Svelte text-content binding. No `{@html}`. The `chr(1)/chr(2)` delimiters used in `ts_headline` are control characters (0x01/0x02) that can never appear in user-submitted document text, so there is no risk of delimiter injection. ✓ **UUID parsing from DB: acceptable risk** `UUID::fromString` in `parseUUIDs` will throw `IllegalArgumentException` on malformed input. Since the source is a `string_agg` of UUID primary key columns, malformed UUIDs cannot be inserted (PostgreSQL enforces UUID column type at storage). The exception path would surface as a 500 error, not a security event. ✓ ### Suggestions (non-blocking) **1. The `webSearchToTsquery` fallback on invalid queries** PostgreSQL's `webSearchToTsquery` returns `NULL` when the query string produces an empty tsquery (e.g. all stopwords). The native query uses `WHERE d.id IN :ids` and the tsquery is only used inside `ts_headline` / `to_tsvector @@ ...` subexpressions — so a NULL tsquery would silently produce NULL headlines and no subquery matches. This is correct behavior (graceful degradation) but there is no test covering the all-stopwords case. Not a security issue, but a correctness edge case. **2. Content Security Policy: `<mark>` injection scenario (informational)** If a CSP is in place (not seen in this PR, but worth noting for future), `<mark>` elements created via DOM manipulation could be flagged. Since these are server-side rendered by Svelte's SSR, this is not a concern here. Just confirming: SSR renders clean HTML, no client-side DOM injection. ✓ **Overall assessment:** The sentinel delimiter approach (`chr(1)/chr(2)`) for `ts_headline` is a clean solution — using control characters that cannot appear in user-submitted text is safer than trying to HTML-escape around arbitrary delimiters. The decision to avoid `{@html}` and use structured `TextSegment[]` is the correct defensive choice.
Author
Owner

🎨 Leonie Voss — UI/UX & Accessibility

Verdict: ⚠️ Approved with concerns

The highlighting approach is semantically correct (<mark> is the right HTML element for this), the snippet placement is well-considered, and the screen reader annotation (sr-only "Fundstelle:") is a lovely touch. Two things need follow-up.

Blockers

None.

Suggestions

1. bg-accent/20 on the <mark> — verify contrast in context

accent maps to #A6DAD8 (brand-mint). At 20% opacity on white, the resulting background is approximately rgba(166, 218, 216, 0.2) = a very light teal. The text inside the mark inherits the text-ink color (#002850 on this light teal background), which passes contrast easily.

However, when the list item is hovered (hover:bg-muted/50), the mark's background blends with the hover background — the highlight may become nearly invisible. Verify this state visually. A slight opacity bump to bg-accent/30 on hover would maintain visual distinctiveness:

<mark class="bg-accent/20 text-inherit not-italic group-hover:bg-accent/30">

2. The snippet has no visual distinction from the metadata row above it

The snippet (text-sm text-ink-2 font-sans) uses the same typographic treatment as the date/location metadata row. A user scanning the card may not recognize it as a "found in transcription" result vs. document metadata.

Consider one of:

  • A subtle left border in accent: border-l-2 border-accent pl-2
  • An italic style to visually distinguish it: italic text-ink-2
  • A micro-label above it: <span class="sr-only">Transkription:</span> (already has that) + visible label for sighted users

At minimum, the sr-only span "Fundstelle: " should probably read "In Transkription:" to be more precise about what matched.

3. line-clamp-2 is correct, but test at 320px

On a narrow mobile viewport, line-clamp-2 at text-sm (14px) renders approximately 2 × 20px = 40px of snippet. Combined with the title, date, sender/receiver rows, and tags, this is a dense card. Verify that the card doesn't become overwhelming on 320px — the snippet is the deepest item in the information hierarchy and should be visually de-emphasized accordingly.

4. not-italic on <mark> is unnecessary

The <mark> HTML element has no default italic styling. not-italic is a no-op here (equivalent to font-style: normal which <mark> already inherits). Remove it to keep the class list clean:

<mark class="bg-accent/20 text-inherit">

What works well:

  • Semantic <mark> element — screen readers announce "highlight" around matched terms ✓
  • text-inherit preserves the serif heading style inside the mark ✓
  • <span class="sr-only">Fundstelle: </span> gives screen reader users context for the snippet ✓
  • line-clamp-2 prevents layout explosion on long transcriptions ✓
  • data-testid="search-snippet" for test targeting ✓
## 🎨 Leonie Voss — UI/UX & Accessibility **Verdict: ⚠️ Approved with concerns** The highlighting approach is semantically correct (`<mark>` is the right HTML element for this), the snippet placement is well-considered, and the screen reader annotation (`sr-only "Fundstelle:"`) is a lovely touch. Two things need follow-up. ### Blockers None. ### Suggestions **1. `bg-accent/20` on the `<mark>` — verify contrast in context** `accent` maps to `#A6DAD8` (brand-mint). At 20% opacity on white, the resulting background is approximately `rgba(166, 218, 216, 0.2)` = a very light teal. The text inside the mark inherits the `text-ink` color (`#002850` on this light teal background), which passes contrast easily. However, when the list item is hovered (`hover:bg-muted/50`), the mark's background blends with the hover background — the highlight may become nearly invisible. Verify this state visually. A slight opacity bump to `bg-accent/30` on hover would maintain visual distinctiveness: ```svelte <mark class="bg-accent/20 text-inherit not-italic group-hover:bg-accent/30"> ``` **2. The snippet has no visual distinction from the metadata row above it** The snippet (`text-sm text-ink-2 font-sans`) uses the same typographic treatment as the date/location metadata row. A user scanning the card may not recognize it as a "found in transcription" result vs. document metadata. Consider one of: - A subtle left border in `accent`: `border-l-2 border-accent pl-2` - An italic style to visually distinguish it: `italic text-ink-2` - A micro-label above it: `<span class="sr-only">Transkription:</span>` (already has that) + visible label for sighted users At minimum, the `sr-only` span "Fundstelle: " should probably read "In Transkription:" to be more precise about what matched. **3. `line-clamp-2` is correct, but test at 320px** On a narrow mobile viewport, `line-clamp-2` at `text-sm` (14px) renders approximately 2 × 20px = 40px of snippet. Combined with the title, date, sender/receiver rows, and tags, this is a dense card. Verify that the card doesn't become overwhelming on 320px — the snippet is the deepest item in the information hierarchy and should be visually de-emphasized accordingly. **4. `not-italic` on `<mark>` is unnecessary** The `<mark>` HTML element has no default italic styling. `not-italic` is a no-op here (equivalent to `font-style: normal` which `<mark>` already inherits). Remove it to keep the class list clean: ```svelte <mark class="bg-accent/20 text-inherit"> ``` **What works well:** - Semantic `<mark>` element — screen readers announce "highlight" around matched terms ✓ - `text-inherit` preserves the serif heading style inside the mark ✓ - `<span class="sr-only">Fundstelle: </span>` gives screen reader users context for the snippet ✓ - `line-clamp-2` prevents layout explosion on long transcriptions ✓ - `data-testid="search-snippet"` for test targeting ✓
Author
Owner

🚀 Tobias Wendt — DevOps & Platform

Verdict: Approved

No infrastructure changes in this PR. No new services, no Docker Compose changes, no CI modifications. The process used (rebuild Docker container to regenerate API types) is correct — the api.ts is committed and the generation step is a dev-time operation, not a CI step.

What I checked

Schema changes: none

No Flyway migrations added. This is purely a query + application code change. The feature works against the existing schema — the transcription_blocks and documents tables are unchanged. ✓

generated/api.ts committed: correct

The regenerated TypeScript types are committed to the repository rather than generated at build time. This is the correct approach for this stack — it avoids requiring the backend to be running during npm run build in CI. ✓

No new environment variables or configuration needed: correct

One thing worth checking out-of-band

Index on transcription_blocks.document_id

The LATERAL JOIN in findEnrichmentData scans transcription_blocks WHERE tb.document_id = d.id. If this column is not indexed, every search with a text query will perform a full table scan of transcription_blocks for each document in the result set.

Check:

\d transcription_blocks  -- in psql
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'transcription_blocks';

If document_id doesn't have an index, add one in a new Flyway migration:

-- V37__add_index_transcription_blocks_document_id.sql
CREATE INDEX IF NOT EXISTS idx_transcription_blocks_document_id ON transcription_blocks(document_id);

This is a follow-up, not a blocker — at the current data volume (family archive scale) the query will be fast either way — but it's worth doing before the archive grows large.

## 🚀 Tobias Wendt — DevOps & Platform **Verdict: ✅ Approved** No infrastructure changes in this PR. No new services, no Docker Compose changes, no CI modifications. The process used (rebuild Docker container to regenerate API types) is correct — the `api.ts` is committed and the generation step is a dev-time operation, not a CI step. ### What I checked **Schema changes: none** No Flyway migrations added. This is purely a query + application code change. The feature works against the existing schema — the `transcription_blocks` and `documents` tables are unchanged. ✓ **`generated/api.ts` committed: correct** The regenerated TypeScript types are committed to the repository rather than generated at build time. This is the correct approach for this stack — it avoids requiring the backend to be running during `npm run build` in CI. ✓ **No new environment variables or configuration needed: correct** ✓ ### One thing worth checking out-of-band **Index on `transcription_blocks.document_id`** The LATERAL JOIN in `findEnrichmentData` scans `transcription_blocks WHERE tb.document_id = d.id`. If this column is not indexed, every search with a text query will perform a full table scan of `transcription_blocks` for each document in the result set. Check: ```sql \d transcription_blocks -- in psql SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'transcription_blocks'; ``` If `document_id` doesn't have an index, add one in a new Flyway migration: ```sql -- V37__add_index_transcription_blocks_document_id.sql CREATE INDEX IF NOT EXISTS idx_transcription_blocks_document_id ON transcription_blocks(document_id); ``` This is a follow-up, not a blocker — at the current data volume (family archive scale) the query will be fast either way — but it's worth doing before the archive grows large.
marcel added 5 commits 2026-04-15 19:13:34 +02:00
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf(search): add index on transcription_blocks.document_id for lateral join
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m29s
CI / Backend Unit Tests (push) Failing after 2m42s
CI / Unit & Component Tests (pull_request) Failing after 2m26s
CI / Backend Unit Tests (pull_request) Failing after 2m35s
211f531513
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Review feedback addressed

All seven reviewer concerns from the multi-persona review have been implemented. Summary:

Backend (commits a7c839a, 2b86634)

Concern Reviewer Fix
documents and total fields not marked required in OpenAPI spec @mkeller Added @Schema(requiredMode = REQUIRED) to both components in DocumentSearchResult.java; verified with reflection-based unit tests
No controller test asserting matchData key in response body @saraholt Added search_responseBodyContainsMatchDataKey test to DocumentControllerTest — stubs a result with a document + match entry and asserts $.matchData.<uuid>.transcriptionSnippet in the JSON response
Missing applyOffsets test for negative start offset @saraholt Added two tests to search.spec.ts: negative start clamped to 0, and negative-end offset (completely ignored) — implementation already handled both cases correctly

Frontend (commit 141511e)

Concern Reviewer Fix
<mark> highlight invisible on hover (list item darkens, mark stays light) @leonievoss Added group-hover:bg-accent/30 to <mark> so it intensifies with the row hover
No-op not-italic class on <mark> @leonievoss Removed
Snippet visually indistinguishable from metadata labels @leonievoss Added italic to the snippet <p> class

Database (commit 211f531)

Concern Reviewer Fix
No index on transcription_blocks.document_id for the LATERAL join @tobiwendt Added Flyway migration V36__add_index_transcription_blocks_document_id.sql
## Review feedback addressed ✅ All seven reviewer concerns from the multi-persona review have been implemented. Summary: ### Backend (commits `a7c839a`, `2b86634`) | Concern | Reviewer | Fix | |---|---|---| | `documents` and `total` fields not marked required in OpenAPI spec | @mkeller | Added `@Schema(requiredMode = REQUIRED)` to both components in `DocumentSearchResult.java`; verified with reflection-based unit tests | | No controller test asserting `matchData` key in response body | @saraholt | Added `search_responseBodyContainsMatchDataKey` test to `DocumentControllerTest` — stubs a result with a document + match entry and asserts `$.matchData.<uuid>.transcriptionSnippet` in the JSON response | | Missing `applyOffsets` test for negative start offset | @saraholt | Added two tests to `search.spec.ts`: negative start clamped to 0, and negative-end offset (completely ignored) — implementation already handled both cases correctly | ### Frontend (commit `141511e`) | Concern | Reviewer | Fix | |---|---|---| | `<mark>` highlight invisible on hover (list item darkens, mark stays light) | @leonievoss | Added `group-hover:bg-accent/30` to `<mark>` so it intensifies with the row hover | | No-op `not-italic` class on `<mark>` | @leonievoss | Removed | | Snippet visually indistinguishable from metadata labels | @leonievoss | Added `italic` to the snippet `<p>` class | ### Database (commit `211f531`) | Concern | Reviewer | Fix | |---|---|---| | No index on `transcription_blocks.document_id` for the LATERAL join | @tobiwendt | Added Flyway migration `V36__add_index_transcription_blocks_document_id.sql` |
marcel added 7 commits 2026-04-15 21:34:56 +02:00
The refactor made pdfDoc a plain variable so renderer.isLoaded was not
reactive. Svelte only tracked currentPage and scale — but when the canvas
reappeared after loading, neither changed, so the PDF stayed blank.

Fix: merge the two effects into one that reads canvasEl synchronously.
Svelte now tracks canvasEl as a dependency; when the canvas remounts
(loading spinner → false), the effect re-fires and renders the
already-loaded PDF document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SearchMatchData gains a 6th field snippetOffsets: List<MatchOffset> so the frontend
  can render highlighted terms inside the transcription snippet without {#html}.
- DocumentRepository.findEnrichmentData now calls ts_headline() with chr(1)/chr(2)
  sentinels instead of returning raw block text; parseHighlight() strips the sentinels
  and produces clean text + MatchOffset list in one pass.
- DocumentService exposes ParsedHighlight and parseHighlight() as public so they can be
  called from cross-package integration tests.
- All related tests updated to the new 6-argument SearchMatchData constructor and
  to call parseHighlight() for asserting the snippet clean text and offsets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace websearch_to_tsquery with a CROSS JOIN LATERAL subquery that
appends :* to each lexeme so prefix matches work (e.g. "furchtb" finds
"furchtbar"). websearch_to_tsquery still handles the safe tokenisation
of user input (stop words, special chars, operators); regexp_replace
then adds :* before to_tsquery re-parses the result.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch search match highlights from bordered mint chips to a plain navy
underline (decoration-brand-navy). Add visible "Inhalt" / "Content" /
"Contenido" label before the transcription snippet, matching the style
of the Von/An sender-receiver labels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(search): surface summary snippet when summary matched the query
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m31s
CI / Backend Unit Tests (push) Failing after 2m39s
CI / Unit & Component Tests (pull_request) Failing after 2m27s
CI / Backend Unit Tests (pull_request) Failing after 4m45s
bca7822ab7
Add a summary_snippet column to findEnrichmentData using ts_headline on
documents.summary, only when the summary's tsvector matches the query.
Expose it via SearchMatchData.summarySnippet / summaryOffsets and render
a "Zusammenfassung" / "Summary" / "Resumen" labelled row in the document
list — identical treatment to the transcription snippet row.

Fixes the case where a document appeared in search results with no
visible match explanation (e.g. searching "frucht" found a document
whose summary mentioned "Früchte").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Overall this is a well-thought-out feature — the sentinel approach is clever, applyOffsets is clean, and the test pyramid is genuinely good. But there's a real bug in the test mocks that needs to be fixed before merge.


Blockers

1. ArrayIndexOutOfBoundsException in DocumentServiceTest enrichment tests

DocumentServiceTest.java (the two new enrichment unit tests) mocks findEnrichmentData with 6-element Object[] arrays:

// Line ~1139
new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null}  // 6 elements
// Line ~1172
new Object[]{docId, "Dok", snippetHeadline, false, null, null}             // 6 elements

But enrichWithMatchData reads 7 columns including row[6] for summaryHeadline:

String summaryHeadline = (String) row[6]; // ← AIOOBE — array only has indices 0–5

These tests will throw ArrayIndexOutOfBoundsException. The fix is a one-liner — add a 7th null to each mock array:

new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null, null}
//                                                                          ^^^^^ summaryHeadline

The PR checklist marks these as passing — please re-verify against the current code.


Suggestions

2. ParsedHighlight and parseHighlight exposed as public from a service

ParsedHighlight is a public nested record inside DocumentService, and parseHighlight is public static. They're only public because the integration test (DocumentSearchEnrichmentTest) calls them directly. This leaks implementation detail out of the service's public API.

The cleaner fix: move ParsedHighlight to the dto/ package (alongside its sibling MatchOffset) and move parseHighlight to a package-private utility or to the dto package. The integration test can then import from dto without touching service internals.

3. enrichWithMatchData reads 7 positional Object[] columns — fragile

Positional access (row[0]row[6]) is brittle: changing column order in the SQL silently maps wrong values. A JPA projection interface would give named access and compile-time safety:

interface EnrichmentRow {
    UUID getId();
    String getTitleHeadline();
    String getTranscriptionSnippet();
    Boolean getSenderMatched();
    String getMatchedReceiverIds();
    String getMatchedTagIds();
    String getSummarySnippet();
}

This is a suggestion, not a blocker — but the Object[] pattern is already fragile with 7 columns.

4. Minor: {#each titleSegments as seg, i (i)} — index key

Keying by index i works correctly here since the segment list is immutable and derived, but Svelte's reconciler will do unnecessary DOM work on re-renders. Not a real problem at this scale but (seg.text + seg.highlight) would be more precise.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** Overall this is a well-thought-out feature — the sentinel approach is clever, `applyOffsets` is clean, and the test pyramid is genuinely good. But there's a real bug in the test mocks that needs to be fixed before merge. --- ### Blockers **1. ArrayIndexOutOfBoundsException in `DocumentServiceTest` enrichment tests** `DocumentServiceTest.java` (the two new enrichment unit tests) mocks `findEnrichmentData` with 6-element `Object[]` arrays: ```java // Line ~1139 new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null} // 6 elements // Line ~1172 new Object[]{docId, "Dok", snippetHeadline, false, null, null} // 6 elements ``` But `enrichWithMatchData` reads 7 columns including `row[6]` for `summaryHeadline`: ```java String summaryHeadline = (String) row[6]; // ← AIOOBE — array only has indices 0–5 ``` These tests will throw `ArrayIndexOutOfBoundsException`. The fix is a one-liner — add a 7th `null` to each mock array: ```java new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null, null} // ^^^^^ summaryHeadline ``` The PR checklist marks these as passing — please re-verify against the current code. --- ### Suggestions **2. `ParsedHighlight` and `parseHighlight` exposed as `public` from a service** `ParsedHighlight` is a `public` nested record inside `DocumentService`, and `parseHighlight` is `public static`. They're only public because the integration test (`DocumentSearchEnrichmentTest`) calls them directly. This leaks implementation detail out of the service's public API. The cleaner fix: move `ParsedHighlight` to the `dto/` package (alongside its sibling `MatchOffset`) and move `parseHighlight` to a package-private utility or to the `dto` package. The integration test can then import from `dto` without touching service internals. **3. `enrichWithMatchData` reads 7 positional `Object[]` columns — fragile** Positional access (`row[0]`…`row[6]`) is brittle: changing column order in the SQL silently maps wrong values. A JPA projection interface would give named access and compile-time safety: ```java interface EnrichmentRow { UUID getId(); String getTitleHeadline(); String getTranscriptionSnippet(); Boolean getSenderMatched(); String getMatchedReceiverIds(); String getMatchedTagIds(); String getSummarySnippet(); } ``` This is a suggestion, not a blocker — but the `Object[]` pattern is already fragile with 7 columns. **4. Minor: `{#each titleSegments as seg, i (i)}` — index key** Keying by index `i` works correctly here since the segment list is immutable and derived, but Svelte's reconciler will do unnecessary DOM work on re-renders. Not a real problem at this scale but `(seg.text + seg.highlight)` would be more precise.
Author
Owner

🏛️ Markus Keller — Application Architect

Verdict: ⚠️ Approved with concerns

The overall feature architecture is sound: backend produces typed offset structs, frontend consumes them without {@html}. That pipeline is clean. A few structural issues are worth addressing.


Concerns

1. ParsedHighlight as a public record nested in DocumentService

A service class is the boundary between repository and controller. Exposing an internal parsing record as a public type from a service erodes that boundary and makes the service's public API surface wider than it needs to be.

ParsedHighlight belongs in dto/ alongside its sibling MatchOffset. parseHighlight can then be a static factory method on ParsedHighlight itself, or a package-private utility — not public static on a service. The current design forces DocumentSearchEnrichmentTest to import a service internals type, which is an odd testing dependency.

2. Object[] with 7 positional columns is an architectural smell

findEnrichmentData returns List<Object[]> with 7 unnamed columns mapped by position. This pattern was acceptable with 2 columns in findRankedIdsByFts, but at 7 it becomes genuinely fragile. Changing column order in the SQL — a refactor any developer might make — silently produces wrong data with no compiler warning.

A Spring Data projection interface gives named access at zero cost:

interface EnrichmentRow {
    UUID getId();
    String getTitleHeadline();
    String getTranscriptionSnippet();
    Boolean getSenderMatched();
    // ...
}

@Query(nativeQuery = true, value = "...")
List<EnrichmentRow> findEnrichmentData(...);

The interface is a structural type; Spring Data generates the mapping. No boilerplate required.

3. Duplicated prefix-query construction in two native queries

The regexp_replace(websearch_to_tsquery('german', :query)::text, '''([^'']+)''', '''\\1'':*', 'g') pattern for prefix-matching appears in both findRankedIdsByFts and findEnrichmentData. If the prefix strategy changes, it must change in two places.

Extraction to a PostgreSQL function keeps this in one place:

CREATE OR REPLACE FUNCTION prefix_tsquery(config regconfig, query text) RETURNS tsquery AS $$
    SELECT to_tsquery(config, regexp_replace(...))
$$ LANGUAGE sql IMMUTABLE;

Both queries then call prefix_tsquery('german', :query).


What's done well

The sentinel approach (chr(1)/chr(2)) for ts_headline delimiters is correct — it avoids regex ambiguity and is safe for all German BMP characters. The V36 migration uses CREATE INDEX IF NOT EXISTS — idempotent and safe. The applyOffsets merge/clamp logic is the right place to handle edge cases rather than spreading defensive code across the component.

## 🏛️ Markus Keller — Application Architect **Verdict: ⚠️ Approved with concerns** The overall feature architecture is sound: backend produces typed offset structs, frontend consumes them without `{@html}`. That pipeline is clean. A few structural issues are worth addressing. --- ### Concerns **1. `ParsedHighlight` as a `public` record nested in `DocumentService`** A service class is the boundary between repository and controller. Exposing an internal parsing record as a `public` type from a service erodes that boundary and makes the service's public API surface wider than it needs to be. `ParsedHighlight` belongs in `dto/` alongside its sibling `MatchOffset`. `parseHighlight` can then be a static factory method on `ParsedHighlight` itself, or a package-private utility — not `public static` on a service. The current design forces `DocumentSearchEnrichmentTest` to import a service internals type, which is an odd testing dependency. **2. `Object[]` with 7 positional columns is an architectural smell** `findEnrichmentData` returns `List<Object[]>` with 7 unnamed columns mapped by position. This pattern was acceptable with 2 columns in `findRankedIdsByFts`, but at 7 it becomes genuinely fragile. Changing column order in the SQL — a refactor any developer might make — silently produces wrong data with no compiler warning. A Spring Data projection interface gives named access at zero cost: ```java interface EnrichmentRow { UUID getId(); String getTitleHeadline(); String getTranscriptionSnippet(); Boolean getSenderMatched(); // ... } @Query(nativeQuery = true, value = "...") List<EnrichmentRow> findEnrichmentData(...); ``` The interface is a structural type; Spring Data generates the mapping. No boilerplate required. **3. Duplicated prefix-query construction in two native queries** The `regexp_replace(websearch_to_tsquery('german', :query)::text, '''([^'']+)''', '''\\1'':*', 'g')` pattern for prefix-matching appears in both `findRankedIdsByFts` and `findEnrichmentData`. If the prefix strategy changes, it must change in two places. Extraction to a PostgreSQL function keeps this in one place: ```sql CREATE OR REPLACE FUNCTION prefix_tsquery(config regconfig, query text) RETURNS tsquery AS $$ SELECT to_tsquery(config, regexp_replace(...)) $$ LANGUAGE sql IMMUTABLE; ``` Both queries then call `prefix_tsquery('german', :query)`. --- ### What's done well The sentinel approach (`chr(1)/chr(2)`) for `ts_headline` delimiters is correct — it avoids regex ambiguity and is safe for all German BMP characters. The `V36` migration uses `CREATE INDEX IF NOT EXISTS` — idempotent and safe. The `applyOffsets` merge/clamp logic is the right place to handle edge cases rather than spreading defensive code across the component.
Author
Owner

🧪 Sara Holt — QA Engineer

Verdict: 🚫 Changes requested

The test pyramid structure is genuinely good: applyOffsets has 10 well-named unit tests, DocumentSearchEnrichmentTest runs against real PostgreSQL with 13 cases, and the component tests cover the key rendering states. But there's a real defect in the mock setup that blocks merge.


Blockers

1. Mock arrays too short — AIOOBE in enrichment unit tests

Both new enrichment unit tests in DocumentServiceTest construct 6-element mock arrays:

// Test 1 (~line 1139)
new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null}
// Test 2 (~line 1172)
new Object[]{docId, "Dok", snippetHeadline, false, null, null}

enrichWithMatchData reads row[6] for summaryHeadline. With a 6-element array (indices 0–5), that's an ArrayIndexOutOfBoundsException. Both tests will fail.

Fix — add a 7th null element to each mock:

new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null, null}
//                                                                          ^^^^ summaryHeadline = null

The PR checklist marks these as green — please re-run ./mvnw test -Dtest=DocumentServiceTest against the current branch to confirm.


Suggestions

2. summarySnippet and summaryOffsets have no integration test coverage

DocumentSearchEnrichmentTest covers title headline, transcription snippet, sender, receiver, and tag matches thoroughly — but the summary field (column 6 in the SQL) has zero test cases. The summary path is real code that runs in production when a document has a summary matching the query.

Suggested additions to DocumentSearchEnrichmentTest:

  • summary_snippet_contains_delimiters_when_summary_matches_query
  • summary_snippet_is_null_when_summary_does_not_match
  • summary_snippet_is_null_when_summary_is_empty

3. MatchOffsetTest.should_hold_start_and_length tests Java record boilerplate

This test verifies that a Java record's auto-generated start() and length() accessors return the constructor arguments. That's the JVM contract for records — it doesn't need a test. Remove it to keep the suite focused on behavior.

4. @BeforeEach cleanup could use @Transactional for rollback isolation

DocumentSearchEnrichmentTest uses explicit deleteAll() in @BeforeEach. That's correct but leaves cleanup dependent on execution order. Adding @Transactional on the test class rolls back after each test automatically and is the project standard for JPA integration tests (@Transactional on the class, not just the method).


What's done well

The applyOffsets test suite in search.spec.ts is excellent: it covers empty offsets, start/middle/end positions, two non-overlapping terms, overlapping merge, adjacent merge, clamping, out-of-bounds, unsorted input, and negative start values. Every interesting edge case is named and tested independently. The data-testid attributes (search-snippet, sender-match, receiver-match, tag-match) make Playwright targeting clean for future E2E tests.

## 🧪 Sara Holt — QA Engineer **Verdict: 🚫 Changes requested** The test pyramid structure is genuinely good: `applyOffsets` has 10 well-named unit tests, `DocumentSearchEnrichmentTest` runs against real PostgreSQL with 13 cases, and the component tests cover the key rendering states. But there's a real defect in the mock setup that blocks merge. --- ### Blockers **1. Mock arrays too short — AIOOBE in enrichment unit tests** Both new enrichment unit tests in `DocumentServiceTest` construct 6-element mock arrays: ```java // Test 1 (~line 1139) new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null} // Test 2 (~line 1172) new Object[]{docId, "Dok", snippetHeadline, false, null, null} ``` `enrichWithMatchData` reads `row[6]` for `summaryHeadline`. With a 6-element array (indices 0–5), that's an `ArrayIndexOutOfBoundsException`. Both tests will fail. Fix — add a 7th `null` element to each mock: ```java new Object[]{docId, "\u0001Brief\u0002 an Anna", null, false, null, null, null} // ^^^^ summaryHeadline = null ``` The PR checklist marks these as green — please re-run `./mvnw test -Dtest=DocumentServiceTest` against the current branch to confirm. --- ### Suggestions **2. `summarySnippet` and `summaryOffsets` have no integration test coverage** `DocumentSearchEnrichmentTest` covers title headline, transcription snippet, sender, receiver, and tag matches thoroughly — but the summary field (column 6 in the SQL) has zero test cases. The summary path is real code that runs in production when a document has a summary matching the query. Suggested additions to `DocumentSearchEnrichmentTest`: - `summary_snippet_contains_delimiters_when_summary_matches_query` - `summary_snippet_is_null_when_summary_does_not_match` - `summary_snippet_is_null_when_summary_is_empty` **3. `MatchOffsetTest.should_hold_start_and_length` tests Java record boilerplate** This test verifies that a Java record's auto-generated `start()` and `length()` accessors return the constructor arguments. That's the JVM contract for records — it doesn't need a test. Remove it to keep the suite focused on behavior. **4. `@BeforeEach` cleanup could use `@Transactional` for rollback isolation** `DocumentSearchEnrichmentTest` uses explicit `deleteAll()` in `@BeforeEach`. That's correct but leaves cleanup dependent on execution order. Adding `@Transactional` on the test class rolls back after each test automatically and is the project standard for JPA integration tests (`@Transactional` on the class, not just the method). --- ### What's done well The `applyOffsets` test suite in `search.spec.ts` is excellent: it covers empty offsets, start/middle/end positions, two non-overlapping terms, overlapping merge, adjacent merge, clamping, out-of-bounds, unsorted input, and negative start values. Every interesting edge case is named and tested independently. The `data-testid` attributes (`search-snippet`, `sender-match`, `receiver-match`, `tag-match`) make Playwright targeting clean for future E2E tests.
Author
Owner

🔒 Nora "NullX" Steiner — Security Engineer

Verdict: Approved

No security vulnerabilities found. This is a well-implemented feature from a security standpoint.


What I checked

SQL injection — All native queries use named parameters (:query, :ids). The websearch_to_tsquery function sanitizes FTS input. The regexp_replace transform operates on the already-parameterized tsquery output, not on raw user input.

XSS — The frontend rendering uses native Svelte text interpolation ({seg.text}) and <mark> elements throughout. There is no {@html} anywhere in this feature. The entire purpose of applyOffsets is to convert backend offset data into typed TextSegment[] objects that eliminate the need for HTML injection — this is the correct XSS defense for this pattern.

Authorization — No new endpoints or permission boundaries. The enrichWithMatchData call is triggered only from the existing GET /api/documents/search endpoint, which sits behind the existing auth chain.

Input sanitization in parseUUIDsUUID.fromString() rejects malformed UUIDs with IllegalArgumentException. Since the input comes from string_agg(r.id::text, ',') in a trusted SQL subquery, injection is structurally impossible.

ParsedHighlight exposure — Making it public is a code design issue, not a security concern.


One low-priority note

The sentinel characters used as ts_headline delimiters are \u0001 (SOH) and \u0002 (STX) — Unicode control characters. PostgreSQL's TEXT type accepts all Unicode, including these. If user-provided document titles or transcription text somehow contained literal \u0001/\u0002 characters, parseHighlight would produce incorrect offsets (the control chars would be misinterpreted as highlight delimiters).

This is not a realistic attack vector (browsers and standard form inputs strip control characters), but if you have a bulk import path (the ODS importer) that reads raw file content, consider stripping C0 control characters (U+0001–U+001F) from title and text fields at ingestion time as a defensive measure.

CWE: CWE-116 (Improper Encoding or Escaping of Output) — low confidence, low impact.

## 🔒 Nora "NullX" Steiner — Security Engineer **Verdict: ✅ Approved** No security vulnerabilities found. This is a well-implemented feature from a security standpoint. --- ### What I checked **SQL injection** — All native queries use named parameters (`:query`, `:ids`). The `websearch_to_tsquery` function sanitizes FTS input. The `regexp_replace` transform operates on the already-parameterized tsquery output, not on raw user input. ✅ **XSS** — The frontend rendering uses native Svelte text interpolation (`{seg.text}`) and `<mark>` elements throughout. There is no `{@html}` anywhere in this feature. The entire purpose of `applyOffsets` is to convert backend offset data into typed `TextSegment[]` objects that eliminate the need for HTML injection — this is the correct XSS defense for this pattern. ✅ **Authorization** — No new endpoints or permission boundaries. The `enrichWithMatchData` call is triggered only from the existing `GET /api/documents/search` endpoint, which sits behind the existing auth chain. ✅ **Input sanitization in `parseUUIDs`** — `UUID.fromString()` rejects malformed UUIDs with `IllegalArgumentException`. Since the input comes from `string_agg(r.id::text, ',')` in a trusted SQL subquery, injection is structurally impossible. ✅ **`ParsedHighlight` exposure** — Making it `public` is a code design issue, not a security concern. --- ### One low-priority note The sentinel characters used as `ts_headline` delimiters are `\u0001` (SOH) and `\u0002` (STX) — Unicode control characters. PostgreSQL's TEXT type accepts all Unicode, including these. If user-provided document titles or transcription text somehow contained literal `\u0001`/`\u0002` characters, `parseHighlight` would produce incorrect offsets (the control chars would be misinterpreted as highlight delimiters). This is not a realistic attack vector (browsers and standard form inputs strip control characters), but if you have a bulk import path (the ODS importer) that reads raw file content, consider stripping C0 control characters (U+0001–U+001F) from title and text fields at ingestion time as a defensive measure. **CWE**: CWE-116 (Improper Encoding or Escaping of Output) — low confidence, low impact.
Author
Owner

🎨 Leonie Voss — UI/UX & Accessibility

Verdict: ⚠️ Approved with concerns

Using native <mark> elements for search highlights is semantically correct, the brand token usage on the underline decoration is on-point, and the not-italic override in snippet marks is a thoughtful detail. A few accessibility and readability issues need attention.


High Priority

1. Snippet and summary text at text-sm (14px) — below minimum for the senior audience

The snippet <p> and summary <p> elements both use text-sm (14px):

class="line-clamp-2 font-sans text-sm text-ink-2 italic"

For our dual audience (25–42 and 60+), 14px is below the 16px recommended minimum for body-adjacent text. Snippets are exactly the content seniors need to read to understand why a document matched — this isn't decoration. Consider text-base (16px). line-clamp-2 still applies cleanly at 16px.


Medium Priority

2. <mark> elements have no screen reader label

The native <mark> element is announced by some screen readers as "highlight" but inconsistently across AT/browser combinations. For keyboard/screen reader users in our senior audience, the highlighted term blends in without context. A minimal improvement:

<mark aria-label="{seg.text} (Treffer)" class="...">
  {seg.text}
</mark>

Or add a visually hidden prefix before the title:

<span class="sr-only">{m.search_result_title_with_match()}</span>

At minimum, add a Paraglide key for this so it can be localized.

3. Verify text-ink-3 contrast on snippet labels

The "Inhalt" / "Zusammenfassung" labels above snippets use text-ink-3:

class="shrink-0 font-sans text-xs font-bold tracking-wide text-ink-3 uppercase"

text-xs is 12px. At 12px you need a 4.5:1 contrast ratio for WCAG AA. Verify that text-ink-3 meets this against the white card background. If it's in the gray-400 range (~#9CA3AF), the contrast is ~2.9:1 — a WCAG fail. The section title pattern in the codebase uses text-gray-400 which already sits at this risk threshold; adding more text-xs labels in the same light-gray family compounds it.


Low Priority

4. Duplicated <mark> class string

The class "bg-transparent text-inherit underline decoration-brand-navy decoration-2 underline-offset-2" appears 3–4 times across the template. Extract to a Svelte {@const} at the top of the component or a CSS class in a <style> block so the styling lives in one place:

{@const markClass = "bg-transparent text-inherit underline decoration-brand-navy decoration-2 underline-offset-2"}

What's done well

<mark> is the semantically correct element for search highlights per HTML5. The not-italic override prevents the italic snippet style from bleeding into highlighted terms. The underline treatment (decoration-brand-navy decoration-2) is consistent with the existing highlight pattern established in earlier commits. Labels ("Inhalt", "Zusammenfassung") are properly i18n'd across all three languages.

## 🎨 Leonie Voss — UI/UX & Accessibility **Verdict: ⚠️ Approved with concerns** Using native `<mark>` elements for search highlights is semantically correct, the brand token usage on the underline decoration is on-point, and the `not-italic` override in snippet marks is a thoughtful detail. A few accessibility and readability issues need attention. --- ### High Priority **1. Snippet and summary text at `text-sm` (14px) — below minimum for the senior audience** The snippet `<p>` and summary `<p>` elements both use `text-sm` (14px): ```svelte class="line-clamp-2 font-sans text-sm text-ink-2 italic" ``` For our dual audience (25–42 and 60+), 14px is below the 16px recommended minimum for body-adjacent text. Snippets are exactly the content seniors need to read to understand why a document matched — this isn't decoration. Consider `text-base` (16px). `line-clamp-2` still applies cleanly at 16px. --- ### Medium Priority **2. `<mark>` elements have no screen reader label** The native `<mark>` element is announced by some screen readers as "highlight" but inconsistently across AT/browser combinations. For keyboard/screen reader users in our senior audience, the highlighted term blends in without context. A minimal improvement: ```svelte <mark aria-label="{seg.text} (Treffer)" class="..."> {seg.text} </mark> ``` Or add a visually hidden prefix before the title: ```svelte <span class="sr-only">{m.search_result_title_with_match()}</span> ``` At minimum, add a Paraglide key for this so it can be localized. **3. Verify `text-ink-3` contrast on snippet labels** The "Inhalt" / "Zusammenfassung" labels above snippets use `text-ink-3`: ```svelte class="shrink-0 font-sans text-xs font-bold tracking-wide text-ink-3 uppercase" ``` `text-xs` is 12px. At 12px you need a 4.5:1 contrast ratio for WCAG AA. Verify that `text-ink-3` meets this against the white card background. If it's in the gray-400 range (~#9CA3AF), the contrast is ~2.9:1 — a WCAG fail. The section title pattern in the codebase uses `text-gray-400` which already sits at this risk threshold; adding more `text-xs` labels in the same light-gray family compounds it. --- ### Low Priority **4. Duplicated `<mark>` class string** The class `"bg-transparent text-inherit underline decoration-brand-navy decoration-2 underline-offset-2"` appears 3–4 times across the template. Extract to a Svelte `{@const}` at the top of the component or a CSS class in a `<style>` block so the styling lives in one place: ```svelte {@const markClass = "bg-transparent text-inherit underline decoration-brand-navy decoration-2 underline-offset-2"} ``` --- ### What's done well `<mark>` is the semantically correct element for search highlights per HTML5. The `not-italic` override prevents the italic snippet style from bleeding into highlighted terms. The underline treatment (`decoration-brand-navy decoration-2`) is consistent with the existing highlight pattern established in earlier commits. Labels ("Inhalt", "Zusammenfassung") are properly i18n'd across all three languages.
Author
Owner

⚙️ Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

No infrastructure changes in this PR. Everything looks clean from the platform side.


What I checked

  • docker-compose.yml — No changes.
  • CI workflows — No changes.
  • New dependencies — None added to pom.xml or package.json.
  • Image tags — No new Docker image references.
  • Secrets / credentials — None in the diff.

V36 migration

CREATE INDEX IF NOT EXISTS idx_transcription_blocks_document_id
    ON transcription_blocks (document_id);

Clean. IF NOT EXISTS makes it idempotent and safe to re-apply. The index on transcription_blocks.document_id is the right support for the lateral join in findEnrichmentData — without it, the lateral join would do a sequential scan per document in the result set.


One operational note

The new findEnrichmentData query runs for every search with a text query. It's a lateral join with a ts_rank(...) ORDER BY ... LIMIT 1 inside — that's efficient when the transcription_blocks.document_id index is hit. At current archive volumes this is fine.

As the document count grows, keep an eye on search latency using EXPLAIN ANALYZE against the production-equivalent database. If ts_rank scans start to dominate at large block counts, a GIN index on to_tsvector('german', tb.text) on transcription_blocks would speed up the inner WHERE ... @@ q.pq filter. Not needed now, but worth having in a backlog item.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** No infrastructure changes in this PR. Everything looks clean from the platform side. --- ### What I checked - **`docker-compose.yml`** — No changes. ✅ - **CI workflows** — No changes. ✅ - **New dependencies** — None added to `pom.xml` or `package.json`. ✅ - **Image tags** — No new Docker image references. ✅ - **Secrets / credentials** — None in the diff. ✅ --- ### V36 migration ```sql CREATE INDEX IF NOT EXISTS idx_transcription_blocks_document_id ON transcription_blocks (document_id); ``` Clean. `IF NOT EXISTS` makes it idempotent and safe to re-apply. The index on `transcription_blocks.document_id` is the right support for the lateral join in `findEnrichmentData` — without it, the lateral join would do a sequential scan per document in the result set. --- ### One operational note The new `findEnrichmentData` query runs for every search with a text query. It's a lateral join with a `ts_rank(...) ORDER BY ... LIMIT 1` inside — that's efficient when the `transcription_blocks.document_id` index is hit. At current archive volumes this is fine. As the document count grows, keep an eye on search latency using `EXPLAIN ANALYZE` against the production-equivalent database. If `ts_rank` scans start to dominate at large block counts, a GIN index on `to_tsvector('german', tb.text)` on `transcription_blocks` would speed up the inner `WHERE ... @@ q.pq` filter. Not needed now, but worth having in a backlog item.
marcel merged commit d7b2357834 into main 2026-04-16 09:10:11 +02:00
marcel deleted branch feat/issue-219-search-snippets 2026-04-16 09:10:13 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#242