feat: show search result snippets with match highlighting #219

Closed
opened 2026-04-12 08:26:09 +02:00 by marcel · 10 comments
Owner

Problem

When searching documents (full-text via q=), the result list shows title, date, sender, receivers, and tags — but not why a document matched. Users can't tell whether the match came from the title, transcription, a tag, or a sender alias. With ~1500 documents and growing, this means opening each result to check relevance.

Proposal

Show a short text snippet (1–2 lines) below each search result that indicates where the match was found, with the matching term highlighted.

Open questions

  • Should the backend return snippets, or should the frontend extract them from existing data?
  • Which fields should produce snippets? (transcription, summary, title, sender/receiver names, tags)
  • How to handle matches in multiple fields — show the "best" one or list all?

Priority

Low — nice-to-have for later, captured here so we don't forget.

## Problem When searching documents (full-text via `q=`), the result list shows title, date, sender, receivers, and tags — but not *why* a document matched. Users can't tell whether the match came from the title, transcription, a tag, or a sender alias. With ~1500 documents and growing, this means opening each result to check relevance. ## Proposal Show a short text snippet (1–2 lines) below each search result that indicates where the match was found, with the matching term highlighted. ### Open questions - Should the backend return snippets, or should the frontend extract them from existing data? - Which fields should produce snippets? (transcription, summary, title, sender/receiver names, tags) - How to handle matches in multiple fields — show the "best" one or list all? ## Priority Low — nice-to-have for later, captured here so we don't forget.
marcel added the featureui labels 2026-04-12 08:26:25 +02:00
Author
Owner

🎨 Leonie Voss — UI/UX Design Discussion

Pre-implementation design decisions agreed with @marcel.

1. Highlight style

Use <mark> (semantic HTML for search highlights). Visual treatment: bg-brand-mint text-brand-navy font-bold. Both background color and bold weight — color-blind users aren't relying on tint alone.

2. Snippet source & length

Backend supplies the snippet (not frontend extraction). Content is a full transcription line — a natural text unit, no mid-sentence truncation needed. Frontend applies a safety cap of ~200 characters with trailing as a defensive fallback for unusually long lines.

3. Field attribution — what gets highlighted where

Backend signals which fields matched. Frontend renders accordingly:

  • Transcription match → dedicated snippet line below the title with <mark> on the matched term
  • Title / sender / receiver match → matched term highlighted inline within the existing card text using <mark> (backend handles this to correctly support stemmed matches)
  • Tag match → entire tag chip gets bg-brand-mint fill instead of the default border style — no character-level markup needed

4. Empty / no-snippet case

Nothing shown. The backend only returns match signals for fields displayed in the card — this case shouldn't arise by design. No fallback label.

5. Snippet placement & typography

Snippet line sits below the title, above the date/sender metadata row. Typography: text-sm text-gray-600 font-serif italic — subordinate to the title, italic signals quoted source text, serif aligns with document content conventions.

6. Single vs. multiple snippets

One transcription snippet per card — the highest-ranked line from the backend. Multiple hits are discoverable inside the document itself. Keeps card height consistent and result lists scannable.


The design avoids a dedicated snippet column for tag/sender matches (they're highlighted in place), which keeps the card layout clean. The only new visual element per card is a single italic line below the title — minimal footprint for a meaningful gain in search relevance clarity.

## 🎨 Leonie Voss — UI/UX Design Discussion Pre-implementation design decisions agreed with @marcel. ### 1. Highlight style Use `<mark>` (semantic HTML for search highlights). Visual treatment: `bg-brand-mint text-brand-navy font-bold`. Both background color and bold weight — color-blind users aren't relying on tint alone. ### 2. Snippet source & length Backend supplies the snippet (not frontend extraction). Content is a full transcription line — a natural text unit, no mid-sentence truncation needed. Frontend applies a safety cap of ~200 characters with trailing `…` as a defensive fallback for unusually long lines. ### 3. Field attribution — what gets highlighted where Backend signals which fields matched. Frontend renders accordingly: - **Transcription match** → dedicated snippet line below the title with `<mark>` on the matched term - **Title / sender / receiver match** → matched term highlighted inline within the existing card text using `<mark>` (backend handles this to correctly support stemmed matches) - **Tag match** → entire tag chip gets `bg-brand-mint` fill instead of the default border style — no character-level markup needed ### 4. Empty / no-snippet case Nothing shown. The backend only returns match signals for fields displayed in the card — this case shouldn't arise by design. No fallback label. ### 5. Snippet placement & typography Snippet line sits **below the title, above the date/sender metadata row**. Typography: `text-sm text-gray-600 font-serif italic` — subordinate to the title, italic signals quoted source text, serif aligns with document content conventions. ### 6. Single vs. multiple snippets **One transcription snippet per card** — the highest-ranked line from the backend. Multiple hits are discoverable inside the document itself. Keeps card height consistent and result lists scannable. --- The design avoids a dedicated snippet column for tag/sender matches (they're highlighted in place), which keeps the card layout clean. The only new visual element per card is a single italic line below the title — minimal footprint for a meaningful gain in search relevance clarity.
Author
Owner

🏗️ Markus Keller — Architecture Discussion

Pre-implementation architectural decisions agreed with @marcel. Builds on the UI discussion above.

1. API response shape

Keep the existing two-step search flow (FTS UUIDs → JPA Specification entity load). No full query rewrite. DocumentSearchResult gains a Map<UUID, SearchMatchData> alongside the existing List<Document> — documents provide card fields, the map provides match overlay data keyed by document ID. DocumentSearchResult.of() factory updated accordingly.

2. No HTML in the API — structured offsets only

Backend returns typed match data, never pre-rendered <mark> HTML strings. {@html ...} in Svelte is an XSS vector and is avoided. The frontend applies a small utility function to split strings and insert <mark> elements at the specified offsets.

3. Snippet extraction — lateral join on transcription_blocks

The search_vector on documents aggregates all transcription block text into one blob — ts_headline() on it would not return individual lines. To return the best matching transcription line, a separate native SQL query with a LEFT JOIN LATERAL on transcription_blocks is needed:

LEFT JOIN LATERAL (
  SELECT tb.text
  FROM transcription_blocks tb
  WHERE tb.document_id = d.id
    AND to_tsvector('german', tb.text) @@ websearch_to_tsquery('german', :query)
  ORDER BY ts_rank(to_tsvector('german', tb.text), websearch_to_tsquery('german', :query)) DESC
  LIMIT 1
) best_block ON true

No schema change required. If this becomes a bottleneck at scale, a tsvector column on transcription_blocks is the future fix.

4. SearchMatchData record shape

Explicit typed fields — no generic Map<String, List<MatchOffset>> that can't express per-entity granularity:

public record SearchMatchData(
    String transcriptionSnippet,     // best matching transcription line, or null
    List<MatchOffset> titleOffsets,  // character offsets within title, empty if no match
    boolean senderMatched,           // sender is 0..1 — bool is sufficient
    List<UUID> matchedReceiverIds,   // which receivers matched (multiple possible)
    List<UUID> matchedTagIds         // which tags matched (multiple possible)
)

public record MatchOffset(int start, int length) {}

Multi-value entity fields (receivers, tags) use matched ID lists so the frontend can highlight specific chips — matchedFields: ["receivers"] would be ambiguous with multiple receivers.

5. Service layer ownership

Snippet/highlight enrichment stays in DocumentService. It's a direct enrichment of the search pipeline, not a separate domain concern. No new service.


The existing DocumentSearchResult DTO and searchDocuments() signature need updates; the OpenAPI spec and frontend type generation will need a regeneration pass after this change.

## 🏗️ Markus Keller — Architecture Discussion Pre-implementation architectural decisions agreed with @marcel. Builds on the UI discussion above. ### 1. API response shape Keep the existing two-step search flow (FTS UUIDs → JPA `Specification` entity load). No full query rewrite. `DocumentSearchResult` gains a `Map<UUID, SearchMatchData>` alongside the existing `List<Document>` — documents provide card fields, the map provides match overlay data keyed by document ID. `DocumentSearchResult.of()` factory updated accordingly. ### 2. No HTML in the API — structured offsets only Backend returns typed match data, never pre-rendered `<mark>` HTML strings. `{@html ...}` in Svelte is an XSS vector and is avoided. The frontend applies a small utility function to split strings and insert `<mark>` elements at the specified offsets. ### 3. Snippet extraction — lateral join on `transcription_blocks` The `search_vector` on `documents` aggregates all transcription block text into one blob — `ts_headline()` on it would not return individual lines. To return the best matching transcription line, a separate native SQL query with a `LEFT JOIN LATERAL` on `transcription_blocks` is needed: ```sql LEFT JOIN LATERAL ( SELECT tb.text FROM transcription_blocks tb WHERE tb.document_id = d.id AND to_tsvector('german', tb.text) @@ websearch_to_tsquery('german', :query) ORDER BY ts_rank(to_tsvector('german', tb.text), websearch_to_tsquery('german', :query)) DESC LIMIT 1 ) best_block ON true ``` No schema change required. If this becomes a bottleneck at scale, a `tsvector` column on `transcription_blocks` is the future fix. ### 4. `SearchMatchData` record shape Explicit typed fields — no generic `Map<String, List<MatchOffset>>` that can't express per-entity granularity: ```java public record SearchMatchData( String transcriptionSnippet, // best matching transcription line, or null List<MatchOffset> titleOffsets, // character offsets within title, empty if no match boolean senderMatched, // sender is 0..1 — bool is sufficient List<UUID> matchedReceiverIds, // which receivers matched (multiple possible) List<UUID> matchedTagIds // which tags matched (multiple possible) ) public record MatchOffset(int start, int length) {} ``` Multi-value entity fields (receivers, tags) use matched ID lists so the frontend can highlight specific chips — `matchedFields: ["receivers"]` would be ambiguous with multiple receivers. ### 5. Service layer ownership Snippet/highlight enrichment stays in `DocumentService`. It's a direct enrichment of the search pipeline, not a separate domain concern. No new service. --- The existing `DocumentSearchResult` DTO and `searchDocuments()` signature need updates; the OpenAPI spec and frontend type generation will need a regeneration pass after this change.
Author
Owner

👨‍💻 Felix Brandt — Implementation Discussion

Pre-implementation decisions on the concrete coding contract. Builds on the UI and architecture discussions above.

1. Title offset computation — ts_headline() with delimiter parsing

Simple Java indexOf is wrong for German FTS: inflected forms ("geschrieben") won't match a literal query ("schreiben"). Use ts_headline() with custom delimiters in the same native SQL query used for the transcription lateral join:

ts_headline('german', d.title, websearch_to_tsquery('german', :query),
    'StartSel=«,StopSel=»,HighlightAll=true')

Java parses «» delimiters to produce List<MatchOffset>. Same pattern applies to sender/receiver names.

2. Enrichment location — private method inside searchDocuments()

searchDocuments() remains the single public entry point. A private enrichWithMatchData(List<Document> docs, String query) runs one batched native SQL query for all matched document IDs. Controller does no orchestration — it calls searchDocuments() and gets a fully-enriched DocumentSearchResult back.

3. Frontend applyOffsets utility — TextSegment[], no {@html}

Lives in $lib/search.ts:

type TextSegment = { text: string; highlighted: boolean }
function applyOffsets(text: string, offsets: MatchOffset[]): TextSegment[]

Template iterates segments with {#each}, wraps highlighted ones in <mark>. No {@html}, no XSS surface. Empty offsets returns a single plain segment — safe no-op.

4. OpenAPI @Schema annotations

@Schema(requiredMode = REQUIRED) on all SearchMatchData and MatchOffset fields except transcriptionSnippet (explicitly nullable — null when no transcription match). Collections are always non-null (empty list, never null). npm run generate:api regeneration pass required after backend changes, committed alongside them.

5. Test strategy

  • Mockito unit testenrichWithMatchData in isolation: mock the batched repository call, assert correct offset parsing from «» delimiters, correct matchedTagIds, null snippet when no block matched.
  • Testcontainers integration test — lateral join + ts_headline() against real PostgreSQL 16. Insert document with known transcription blocks, assert correct snippet line and title offsets. This is the test that catches stemming and delimiter parsing bugs.
  • Vitest unit testsapplyOffsets: empty offsets, single offset, multiple offsets, offset at string boundaries.
  • No E2E needed — unit + integration layer covers the contract.

The key implementation detail that emerged: ts_headline() with delimiter parsing is the only correct approach for title offsets given German FTS stemming. A naive indexOf would silently produce wrong highlights for inflected forms — which is the norm, not the edge case, in a 19th-century German family archive.

## 👨‍💻 Felix Brandt — Implementation Discussion Pre-implementation decisions on the concrete coding contract. Builds on the UI and architecture discussions above. ### 1. Title offset computation — `ts_headline()` with delimiter parsing Simple Java `indexOf` is wrong for German FTS: inflected forms ("geschrieben") won't match a literal query ("schreiben"). Use `ts_headline()` with custom delimiters in the same native SQL query used for the transcription lateral join: ```sql ts_headline('german', d.title, websearch_to_tsquery('german', :query), 'StartSel=«,StopSel=»,HighlightAll=true') ``` Java parses `«»` delimiters to produce `List<MatchOffset>`. Same pattern applies to sender/receiver names. ### 2. Enrichment location — private method inside `searchDocuments()` `searchDocuments()` remains the single public entry point. A private `enrichWithMatchData(List<Document> docs, String query)` runs one batched native SQL query for all matched document IDs. Controller does no orchestration — it calls `searchDocuments()` and gets a fully-enriched `DocumentSearchResult` back. ### 3. Frontend `applyOffsets` utility — `TextSegment[]`, no `{@html}` Lives in `$lib/search.ts`: ```typescript type TextSegment = { text: string; highlighted: boolean } function applyOffsets(text: string, offsets: MatchOffset[]): TextSegment[] ``` Template iterates segments with `{#each}`, wraps highlighted ones in `<mark>`. No `{@html}`, no XSS surface. Empty offsets returns a single plain segment — safe no-op. ### 4. OpenAPI `@Schema` annotations `@Schema(requiredMode = REQUIRED)` on all `SearchMatchData` and `MatchOffset` fields **except** `transcriptionSnippet` (explicitly nullable — null when no transcription match). Collections are always non-null (empty list, never null). `npm run generate:api` regeneration pass required after backend changes, committed alongside them. ### 5. Test strategy - **Mockito unit test** — `enrichWithMatchData` in isolation: mock the batched repository call, assert correct offset parsing from `«»` delimiters, correct `matchedTagIds`, null snippet when no block matched. - **Testcontainers integration test** — lateral join + `ts_headline()` against real PostgreSQL 16. Insert document with known transcription blocks, assert correct snippet line and title offsets. This is the test that catches stemming and delimiter parsing bugs. - **Vitest unit tests** — `applyOffsets`: empty offsets, single offset, multiple offsets, offset at string boundaries. - No E2E needed — unit + integration layer covers the contract. --- The key implementation detail that emerged: `ts_headline()` with delimiter parsing is the only correct approach for title offsets given German FTS stemming. A naive `indexOf` would silently produce wrong highlights for inflected forms — which is the norm, not the edge case, in a 19th-century German family archive.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

My pre-implementation design contract is already captured above. A few remaining implementation observations before we start coding:

Edge cases worth a second look

  • «» delimiter collision: ts_headline() with StartSel=«,StopSel=» is elegant, but « and » (\u00AB/\u00BB) are valid German typographic quotation marks and do appear in historical archive documents. If a title contains «Mein Leben», the «» parser will produce wrong offsets silently. Safer delimiters: \x01 / \x02 (ASCII control characters — cannot appear in text content), or a sequence like |||HL_START||| / |||HL_END|||. Worth confirming before the native query is written.

  • Empty docs list in enrichWithMatchData: An IN () clause with zero elements is invalid SQL in most dialects. The private method should short-circuit with an empty map when docs.isEmpty(). This edge is easy to miss and would only surface on a search returning zero results.

  • Zero-length query: When q= is empty or absent, enrichWithMatchData should return an all-empty/null SearchMatchData per document and not call into ts_headline() or the lateral join at all. The websearch_to_tsquery('german', '') call raises an error in PostgreSQL.

Test fixture detail

The Testcontainers integration test needs at least three transcription blocks per document with varying FTS rank — one block with the query term twice, one with it once, one without it — to actually verify that the highest-ranked block is selected and not just the first or last.

Question

Is MatchOffset(int start, int length) measured in Java String characters (UTF-16 code units) or Unicode code points? ts_headline() operates on PostgreSQL's UTF-8 character positions. For ASCII-only Latin text this is irrelevant, but German historical documents may include ß, ü, ö, ä (all BMP, safe in UTF-16), and occasionally older ligatures. Worth verifying the offset semantics are consistent end-to-end before the parsing utility is written.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer My pre-implementation design contract is already captured above. A few remaining implementation observations before we start coding: ### Edge cases worth a second look - **`«»` delimiter collision**: `ts_headline()` with `StartSel=«,StopSel=»` is elegant, but `«` and `»` (`\u00AB`/`\u00BB`) are valid German typographic quotation marks and *do* appear in historical archive documents. If a title contains `«Mein Leben»`, the `«»` parser will produce wrong offsets silently. Safer delimiters: `\x01` / `\x02` (ASCII control characters — cannot appear in text content), or a sequence like `|||HL_START|||` / `|||HL_END|||`. Worth confirming before the native query is written. - **Empty `docs` list in `enrichWithMatchData`**: An `IN ()` clause with zero elements is invalid SQL in most dialects. The private method should short-circuit with an empty map when `docs.isEmpty()`. This edge is easy to miss and would only surface on a search returning zero results. - **Zero-length query**: When `q=` is empty or absent, `enrichWithMatchData` should return an all-empty/null `SearchMatchData` per document and not call into `ts_headline()` or the lateral join at all. The `websearch_to_tsquery('german', '')` call raises an error in PostgreSQL. ### Test fixture detail The Testcontainers integration test needs at least three transcription blocks per document with varying FTS rank — one block with the query term twice, one with it once, one without it — to actually verify that the highest-ranked block is selected and not just the first or last. ### Question Is `MatchOffset(int start, int length)` measured in Java `String` characters (UTF-16 code units) or Unicode code points? `ts_headline()` operates on PostgreSQL's UTF-8 character positions. For ASCII-only Latin text this is irrelevant, but German historical documents may include `ß`, `ü`, `ö`, `ä` (all BMP, safe in UTF-16), and occasionally older ligatures. Worth verifying the offset semantics are consistent end-to-end before the parsing utility is written.
Author
Owner

🏗️ Markus Keller — Application Architect

The architectural decisions in my pre-implementation comment hold. A few structural questions that emerged when reviewing the full picture:

Questions & Observations

  • Short-circuit when no text query is present: The search endpoint accepts filter-only requests (by date, tag, sender) with no q= parameter. In that case, the enrichment path — ts_headline(), the lateral join, offset parsing — must not run. The searchDocuments() signature should make this explicit: pass a nullable String query to enrichWithMatchData, and return an empty Map immediately when it's null. Letting the enrichment run with a null query risks a NullPointerException inside the native SQL call.

  • DocumentSearchResult API change scope: Adding Map<UUID, SearchMatchData> to the response is a backward-incompatible change. Any existing API test (backend/api_tests/) that asserts on the search response shape will fail. Worth identifying those files before starting implementation so the test-update scope is known.

  • Race condition on document deletion: The two-step flow (FTS UUIDs → JPA entity load) already exists. The lateral join in enrichWithMatchData runs after the JPA load, scoped to WHERE d.id IN (...). If a document is deleted between the JPA load and the enrichment query, the lateral join just returns no rows for that ID — the map entry is absent, which is safe. But this path should be explicitly covered: absent map entry → frontend treats it as no match data, not as an error.

  • SearchMatchData as a Java record vs. a class: The record declaration is the right call — it's immutable, structurally equal, and readable. Confirm it carries the @Schema annotations correctly for OpenAPI generation. Java records support @Schema on canonical constructor parameters, but the OpenAPI annotation placement differs from fields on a class. Worth verifying with a quick ./mvnw clean package -DskipTests + spec inspection before going further.

Suggestion

The DocumentSearchResult.of() factory is the one place that wires documents and match data together. Given that this method is now doing more work (keying a map, handling absent entries), it's worth naming clearly: DocumentSearchResult.withMatchData(List<Document>, Map<UUID, SearchMatchData>) would communicate intent better than an overloaded of().

## 🏗️ Markus Keller — Application Architect The architectural decisions in my pre-implementation comment hold. A few structural questions that emerged when reviewing the full picture: ### Questions & Observations - **Short-circuit when no text query is present**: The search endpoint accepts filter-only requests (by date, tag, sender) with no `q=` parameter. In that case, the enrichment path — `ts_headline()`, the lateral join, offset parsing — must not run. The `searchDocuments()` signature should make this explicit: pass a nullable `String query` to `enrichWithMatchData`, and return an empty `Map` immediately when it's null. Letting the enrichment run with a null query risks a `NullPointerException` inside the native SQL call. - **`DocumentSearchResult` API change scope**: Adding `Map<UUID, SearchMatchData>` to the response is a backward-incompatible change. Any existing API test (`backend/api_tests/`) that asserts on the search response shape will fail. Worth identifying those files before starting implementation so the test-update scope is known. - **Race condition on document deletion**: The two-step flow (FTS UUIDs → JPA entity load) already exists. The lateral join in `enrichWithMatchData` runs *after* the JPA load, scoped to `WHERE d.id IN (...)`. If a document is deleted between the JPA load and the enrichment query, the lateral join just returns no rows for that ID — the map entry is absent, which is safe. But this path should be explicitly covered: absent map entry → frontend treats it as no match data, not as an error. - **`SearchMatchData` as a Java record vs. a class**: The record declaration is the right call — it's immutable, structurally equal, and readable. Confirm it carries the `@Schema` annotations correctly for OpenAPI generation. Java records support `@Schema` on canonical constructor parameters, but the OpenAPI annotation placement differs from fields on a class. Worth verifying with a quick `./mvnw clean package -DskipTests` + spec inspection before going further. ### Suggestion The `DocumentSearchResult.of()` factory is the one place that wires documents and match data together. Given that this method is now doing more work (keying a map, handling absent entries), it's worth naming clearly: `DocumentSearchResult.withMatchData(List<Document>, Map<UUID, SearchMatchData>)` would communicate intent better than an overloaded `of()`.
Author
Owner

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

The design decisions in my earlier comment cover the main cases well. A few accessibility and mobile details that need attention during implementation:

Accessibility of <mark> in context

<mark> is semantically correct for search highlights and most screen readers (NVDA 2024+, VoiceOver on macOS) announce it as "highlighted" or similar. However, JAWS on some configurations reads it without any special announcement, meaning a screen reader user won't know why a term is visually distinct. Worth adding a visually-hidden context hint — either a <span class="sr-only"> near the search results heading (e.g. "Matches highlighted") or aria-label on the results list. Not a blocker, but something to address before launch given the senior audience.

Contrast check for tag chip highlight state

The default tag chip uses a border style (brand-sand border, white background). The matched-tag state switches to bg-brand-mint fill. Need to confirm the tag label text color on that background:

  • If tag text is text-brand-navy (#002850 on #A6DAD8): contrast ≈ 9.6:1 — AAA ✓
  • If tag text is text-gray-600 or similar: verify before coding

When in doubt, set matched-tag text explicitly to text-brand-navy so the contrast is guaranteed.

Mobile: snippet line on small screens

On 320px viewports, the snippet line text-sm text-gray-600 font-serif italic sits between the title and the metadata row. A 200-character snippet will wrap to 4–5 lines on mobile, which can make the card feel dense. Recommend adding line-clamp-2 (Tailwind's [@supports(display:-webkit-box)]:line-clamp-2) to limit the visible snippet to two lines with a trailing ellipsis. This preserves card scannability on mobile while keeping the full snippet reachable via the document detail view.

Spacing token between snippet and metadata row

The design comment doesn't specify the vertical gap between the snippet line and the date/sender row. Suggest mt-1.5 (6px) — enough to visually separate them without adding significant height to the card. Should be specified so implementation is consistent across all result cards.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist The design decisions in my earlier comment cover the main cases well. A few accessibility and mobile details that need attention during implementation: ### Accessibility of `<mark>` in context `<mark>` is semantically correct for search highlights and most screen readers (NVDA 2024+, VoiceOver on macOS) announce it as "highlighted" or similar. However, JAWS on some configurations reads it without any special announcement, meaning a screen reader user won't know why a term is visually distinct. Worth adding a visually-hidden context hint — either a `<span class="sr-only">` near the search results heading (e.g. "Matches highlighted") or `aria-label` on the results list. Not a blocker, but something to address before launch given the senior audience. ### Contrast check for tag chip highlight state The default tag chip uses a border style (brand-sand border, white background). The matched-tag state switches to `bg-brand-mint` fill. Need to confirm the tag label text color on that background: - If tag text is `text-brand-navy` (#002850 on #A6DAD8): contrast ≈ 9.6:1 — AAA ✓ - If tag text is `text-gray-600` or similar: verify before coding When in doubt, set matched-tag text explicitly to `text-brand-navy` so the contrast is guaranteed. ### Mobile: snippet line on small screens On 320px viewports, the snippet line `text-sm text-gray-600 font-serif italic` sits between the title and the metadata row. A 200-character snippet will wrap to 4–5 lines on mobile, which can make the card feel dense. Recommend adding `line-clamp-2` (Tailwind's `[@supports(display:-webkit-box)]:line-clamp-2`) to limit the visible snippet to two lines with a trailing ellipsis. This preserves card scannability on mobile while keeping the full snippet reachable via the document detail view. ### Spacing token between snippet and metadata row The design comment doesn't specify the vertical gap between the snippet line and the date/sender row. Suggest `mt-1.5` (6px) — enough to visually separate them without adding significant height to the card. Should be specified so implementation is consistent across all result cards.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Felix's test strategy covers the main cases. Here's what I'd want filled in before implementation starts:

Missing acceptance criteria

The issue describes the feature and open questions but has no AC. Before picking this up, define the minimum to call it done:

  • Does a search for "schreiben" highlight geschrieben in a title? (Stemming)
  • Does a filter-only search (no q=) show no highlights? (No false positives)
  • Do tag chips remain visually unchanged when no tag matched?

Without AC, "done" is subjective and scope creep risk is real.

Edge cases Felix's test plan doesn't explicitly cover

  • Overlapping offsets: A query like "Brandt Brand" could produce two offsets that overlap ([0,5] and [0,8] on "Brandt"). What does applyOffsets do? Merge to longest? Take first? Silently produce garbled output? This needs a Vitest test and a defined behavior before implementation.
  • Empty query (q=): What does the API return when called with q=? The enrichment should produce all-empty SearchMatchData, not an error. Integration test needed.
  • Document with zero transcription blocks: The lateral join must return no snippet row (not a query error). Integration test: insert document with no blocks, confirm transcriptionSnippet is null, no 500.
  • Title with no match: ts_headline() returns the original title text unmodified (no delimiters). The Java parser must produce an empty List<MatchOffset>, not throw on the absence of delimiters. Unit test needed.

Test naming reminder

Felix mentioned test names but didn't list them — worth writing them out in the issue so they can be checked off during code review:

should_return_null_snippet_when_no_transcription_blocks_match()
should_return_empty_title_offsets_when_title_does_not_match()
should_short_circuit_enrichment_when_query_is_null()
applyOffsets: overlapping offsets produce correct segments
applyOffsets: offset at string boundary does not throw

CI time impact

The Testcontainers integration test spins up a real PostgreSQL container. If the search integration tests share a container with existing tests (via @Container static at the class level), the overhead is just the test time itself. If it needs a fresh container, add ~10–15 seconds. Worth verifying the total integration test suite stays under the 2-minute target.

## 🧪 Sara Holt — QA Engineer & Test Strategist Felix's test strategy covers the main cases. Here's what I'd want filled in before implementation starts: ### Missing acceptance criteria The issue describes the feature and open questions but has no AC. Before picking this up, define the minimum to call it done: - Does a search for `"schreiben"` highlight `geschrieben` in a title? (Stemming) - Does a filter-only search (no `q=`) show no highlights? (No false positives) - Do tag chips remain visually unchanged when no tag matched? Without AC, "done" is subjective and scope creep risk is real. ### Edge cases Felix's test plan doesn't explicitly cover - **Overlapping offsets**: A query like `"Brandt Brand"` could produce two offsets that overlap (`[0,5]` and `[0,8]` on "Brandt"). What does `applyOffsets` do? Merge to longest? Take first? Silently produce garbled output? This needs a Vitest test and a defined behavior before implementation. - **Empty query (`q=`)**: What does the API return when called with `q=`? The enrichment should produce all-empty `SearchMatchData`, not an error. Integration test needed. - **Document with zero transcription blocks**: The lateral join must return no snippet row (not a query error). Integration test: insert document with no blocks, confirm `transcriptionSnippet` is null, no 500. - **Title with no match**: `ts_headline()` returns the original title text unmodified (no delimiters). The Java parser must produce an empty `List<MatchOffset>`, not throw on the absence of delimiters. Unit test needed. ### Test naming reminder Felix mentioned test names but didn't list them — worth writing them out in the issue so they can be checked off during code review: ``` should_return_null_snippet_when_no_transcription_blocks_match() should_return_empty_title_offsets_when_title_does_not_match() should_short_circuit_enrichment_when_query_is_null() applyOffsets: overlapping offsets produce correct segments applyOffsets: offset at string boundary does not throw ``` ### CI time impact The Testcontainers integration test spins up a real PostgreSQL container. If the search integration tests share a container with existing tests (via `@Container static` at the class level), the overhead is just the test time itself. If it needs a fresh container, add ~10–15 seconds. Worth verifying the total integration test suite stays under the 2-minute target.
Author
Owner

🔐 Nora Steiner — Application Security Engineer

Overall the design is sound from a security perspective. The {@html}-free approach with TextSegment[] is exactly the right call. A few things worth verifying before and during implementation:

What's already right

The decision to return structured offsets from the backend rather than pre-rendered HTML eliminates the XSS surface entirely. This is the correct architecture. A backend that returned <mark> strings would require {@html} on the frontend, which is an XSS vector — the design correctly avoids this. No concerns there.

Offset bounds validation in applyOffsets

The applyOffsets frontend utility receives offset data from the backend. In normal operation the backend is trusted, but defensive validation in the frontend utility is still good practice:

function applyOffsets(text: string, offsets: MatchOffset[]): TextSegment[] {
  // Validate each offset is within bounds before slicing
  const safe = offsets.filter(o => o.start >= 0 && o.start + o.length <= text.length);
  // ... rest of implementation
}

A buggy backend (or future regression in the enrichment SQL) that returns out-of-range offsets should produce no highlight, not a runtime error. JavaScript's slice() handles out-of-bounds gracefully, but explicit filtering makes the intent clear and prevents potential downstream issues.

Authorization on the enriched search endpoint

Confirm GET /api/documents?q=... is covered by @RequirePermission(READ_ALL). The search results now include transcription text snippets — actual content from the documents. An unauthenticated or under-privileged call leaking these snippets is a more impactful information disclosure than leaking just the document title. This permission check should already be in place, but it's worth verifying explicitly given the new content being exposed.

Delimiter injection via document content

Felix's comment flags the «» collision risk from a correctness angle. From a security angle: if document titles are user-controlled (they are — editors can set any title), a crafted title like «normal text» fake highlight «real content» could manipulate the offset parser into producing false highlights. This is low-severity (no code execution, just misleading UI), but it's another argument for choosing \x01/\x02 delimiters that cannot appear in user-supplied content.

No concerns on the backend SQL

ts_headline() and the lateral join use parameterized queries (:query named parameter) — no SQL injection surface. The FTS input goes through websearch_to_tsquery() which handles sanitization internally. Clean.

## 🔐 Nora Steiner — Application Security Engineer Overall the design is sound from a security perspective. The `{@html}`-free approach with `TextSegment[]` is exactly the right call. A few things worth verifying before and during implementation: ### ✅ What's already right The decision to return structured offsets from the backend rather than pre-rendered HTML eliminates the XSS surface entirely. This is the correct architecture. A backend that returned `<mark>` strings would require `{@html}` on the frontend, which is an XSS vector — the design correctly avoids this. No concerns there. ### Offset bounds validation in `applyOffsets` The `applyOffsets` frontend utility receives offset data from the backend. In normal operation the backend is trusted, but defensive validation in the frontend utility is still good practice: ```typescript function applyOffsets(text: string, offsets: MatchOffset[]): TextSegment[] { // Validate each offset is within bounds before slicing const safe = offsets.filter(o => o.start >= 0 && o.start + o.length <= text.length); // ... rest of implementation } ``` A buggy backend (or future regression in the enrichment SQL) that returns out-of-range offsets should produce no highlight, not a runtime error. JavaScript's `slice()` handles out-of-bounds gracefully, but explicit filtering makes the intent clear and prevents potential downstream issues. ### Authorization on the enriched search endpoint Confirm `GET /api/documents?q=...` is covered by `@RequirePermission(READ_ALL)`. The search results now include transcription text snippets — actual content from the documents. An unauthenticated or under-privileged call leaking these snippets is a more impactful information disclosure than leaking just the document title. This permission check should already be in place, but it's worth verifying explicitly given the new content being exposed. ### Delimiter injection via document content Felix's comment flags the `«»` collision risk from a correctness angle. From a security angle: if document titles are user-controlled (they are — editors can set any title), a crafted title like `«normal text» fake highlight «real content»` could manipulate the offset parser into producing false highlights. This is low-severity (no code execution, just misleading UI), but it's another argument for choosing `\x01`/`\x02` delimiters that cannot appear in user-supplied content. ### No concerns on the backend SQL `ts_headline()` and the lateral join use parameterized queries (`:query` named parameter) — no SQL injection surface. The FTS input goes through `websearch_to_tsquery()` which handles sanitization internally. Clean.
Author
Owner

⚙️ Tobias Wendt — DevOps & Platform Engineer

No new infrastructure needed for this feature. That's good. A few observations from the platform side:

OpenAPI spec drift should be enforced in CI

The implementation plan notes that npm run generate:api must be committed alongside backend changes. Currently there's no CI gate that verifies the generated TypeScript types match the live spec. With SearchMatchData and MatchOffset being new types, a future refactor that changes the record shape without regenerating types would fail silently at runtime. Consider adding a CI step:

- name: Verify generated types are up to date
  run: |
    npm run generate:api
    git diff --exit-code frontend/src/lib/api/  # fails if types are stale

This catches spec drift at PR time, not in production.

Query performance: slow query logging

The lateral join is a new SQL pattern in this codebase. It runs on every text search. At 1,500 documents it should be fast; at 15,000 it may not be. Is slow query logging enabled on PostgreSQL? The default log_min_duration_statement in the Docker Compose setup is worth checking. Setting it to 500ms in application-dev.yaml or the Compose environment would surface regressions during development before they reach production.

If the current compose file doesn't set this, a one-line addition covers it:

db:
  command: postgres -c log_min_duration_statement=500

Future index tracked?

Markus's comment notes: "If this becomes a bottleneck at scale, a tsvector column on transcription_blocks is the future fix." Recommend opening a follow-up issue to track this before closing this one, so it doesn't get lost as the archive grows. The ideal trigger: run EXPLAIN ANALYZE on the lateral join after go-live; if seq scan on transcription_blocks appears, that's the signal.

No other infra changes needed

No new Docker services, no config changes, no MinIO interaction, no additional CI secrets. The OpenAPI regen is an existing script. CI impact is limited to the new Testcontainers integration test (within the 2-minute budget per Sara's note). All good here.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer No new infrastructure needed for this feature. That's good. A few observations from the platform side: ### OpenAPI spec drift should be enforced in CI The implementation plan notes that `npm run generate:api` must be committed alongside backend changes. Currently there's no CI gate that verifies the generated TypeScript types match the live spec. With `SearchMatchData` and `MatchOffset` being new types, a future refactor that changes the record shape without regenerating types would fail silently at runtime. Consider adding a CI step: ```yaml - name: Verify generated types are up to date run: | npm run generate:api git diff --exit-code frontend/src/lib/api/ # fails if types are stale ``` This catches spec drift at PR time, not in production. ### Query performance: slow query logging The lateral join is a new SQL pattern in this codebase. It runs on every text search. At 1,500 documents it should be fast; at 15,000 it may not be. Is slow query logging enabled on PostgreSQL? The default `log_min_duration_statement` in the Docker Compose setup is worth checking. Setting it to `500ms` in `application-dev.yaml` or the Compose environment would surface regressions during development before they reach production. If the current compose file doesn't set this, a one-line addition covers it: ```yaml db: command: postgres -c log_min_duration_statement=500 ``` ### Future index tracked? Markus's comment notes: "If this becomes a bottleneck at scale, a `tsvector` column on `transcription_blocks` is the future fix." Recommend opening a follow-up issue to track this before closing this one, so it doesn't get lost as the archive grows. The ideal trigger: run `EXPLAIN ANALYZE` on the lateral join after go-live; if seq scan on `transcription_blocks` appears, that's the signal. ### No other infra changes needed No new Docker services, no config changes, no MinIO interaction, no additional CI secrets. The OpenAPI regen is an existing script. CI impact is limited to the new Testcontainers integration test (within the 2-minute budget per Sara's note). All good here.
Author
Owner

Implemented in PR #242

All acceptance criteria from the issue have been implemented on branch feat/issue-219-search-snippets.

Commits

  • 47da0fafeat(search): add MatchOffset record for character-level highlight positions
  • 8cbecd4feat(search): add SearchMatchData record for per-document match signals
  • 003d68efeat(search): add DocumentSearchResult.withMatchData() factory with match overlay map
  • 8526e6ctest(search): add DocumentSearchEnrichmentTest for findEnrichmentData native query
  • 8c7ce14feat(search): enrich searchDocuments with per-document match data
  • 9673ceffeat(search): add applyOffsets utility and regenerate API types with MatchOffset/SearchMatchData
  • 93c7843feat(search): render title highlights and transcription snippets in DocumentList
  • ddb87dbfeat(search): pass matchData from server load to DocumentList

What was built

Backend:

  • MatchOffset record (start, length as Java char positions, JS-compatible)
  • SearchMatchData record: transcriptionSnippet, titleOffsets, senderMatched, matchedReceiverIds, matchedTagIds
  • DocumentRepository.findEnrichmentData(): native SQL with lateral join + ts_headline using chr(1)/chr(2) sentinel delimiters to avoid regex ambiguity
  • DocumentService.enrichWithMatchData(): parses headlines → List<MatchOffset>, short-circuits for empty queries
  • DocumentSearchResult wraps documents + matchData; controller returns it directly

Frontend:

  • applyOffsets(text, offsets) utility in $lib/search.ts: sorts, merges overlapping spans, splits string into TextSegment[]
  • DocumentList renders <mark> elements for highlighted title terms (no {@html})
  • Transcription snippet shown below each result card when a match exists (data-testid="search-snippet")

Tests added: 13 backend integration tests, 3 service tests, 10 applyOffsets unit tests, 4 component tests. All 832 frontend tests green.

## Implemented in PR #242 All acceptance criteria from the issue have been implemented on branch `feat/issue-219-search-snippets`. ### Commits - `47da0fa` — `feat(search): add MatchOffset record for character-level highlight positions` - `8cbecd4` — `feat(search): add SearchMatchData record for per-document match signals` - `003d68e` — `feat(search): add DocumentSearchResult.withMatchData() factory with match overlay map` - `8526e6c` — `test(search): add DocumentSearchEnrichmentTest for findEnrichmentData native query` - `8c7ce14` — `feat(search): enrich searchDocuments with per-document match data` - `9673cef` — `feat(search): add applyOffsets utility and regenerate API types with MatchOffset/SearchMatchData` - `93c7843` — `feat(search): render title highlights and transcription snippets in DocumentList` - `ddb87db` — `feat(search): pass matchData from server load to DocumentList` ### What was built **Backend:** - `MatchOffset` record (`start`, `length` as Java `char` positions, JS-compatible) - `SearchMatchData` record: `transcriptionSnippet`, `titleOffsets`, `senderMatched`, `matchedReceiverIds`, `matchedTagIds` - `DocumentRepository.findEnrichmentData()`: native SQL with lateral join + `ts_headline` using `chr(1)/chr(2)` sentinel delimiters to avoid regex ambiguity - `DocumentService.enrichWithMatchData()`: parses headlines → `List<MatchOffset>`, short-circuits for empty queries - `DocumentSearchResult` wraps documents + matchData; controller returns it directly **Frontend:** - `applyOffsets(text, offsets)` utility in `$lib/search.ts`: sorts, merges overlapping spans, splits string into `TextSegment[]` - `DocumentList` renders `<mark>` elements for highlighted title terms (no `{@html}`) - Transcription snippet shown below each result card when a match exists (`data-testid="search-snippet"`) **Tests added:** 13 backend integration tests, 3 service tests, 10 `applyOffsets` unit tests, 4 component tests. All 832 frontend tests green.
Sign in to join this conversation.
No Label feature ui
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#219