feat: show search result snippets with match highlighting #219
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When searching documents (full-text via
q=), the result list shows title, date, sender, receivers, and tags — but not why a document matched. Users can't tell whether the match came from the title, transcription, a tag, or a sender alias. With ~1500 documents and growing, this means opening each result to check relevance.Proposal
Show a short text snippet (1–2 lines) below each search result that indicates where the match was found, with the matching term highlighted.
Open questions
Priority
Low — nice-to-have for later, captured here so we don't forget.
🎨 Leonie Voss — UI/UX Design Discussion
Pre-implementation design decisions agreed with @marcel.
1. Highlight style
Use
<mark>(semantic HTML for search highlights). Visual treatment:bg-brand-mint text-brand-navy font-bold. Both background color and bold weight — color-blind users aren't relying on tint alone.2. Snippet source & length
Backend supplies the snippet (not frontend extraction). Content is a full transcription line — a natural text unit, no mid-sentence truncation needed. Frontend applies a safety cap of ~200 characters with trailing
…as a defensive fallback for unusually long lines.3. Field attribution — what gets highlighted where
Backend signals which fields matched. Frontend renders accordingly:
<mark>on the matched term<mark>(backend handles this to correctly support stemmed matches)bg-brand-mintfill instead of the default border style — no character-level markup needed4. Empty / no-snippet case
Nothing shown. The backend only returns match signals for fields displayed in the card — this case shouldn't arise by design. No fallback label.
5. Snippet placement & typography
Snippet line sits below the title, above the date/sender metadata row. Typography:
text-sm text-gray-600 font-serif italic— subordinate to the title, italic signals quoted source text, serif aligns with document content conventions.6. Single vs. multiple snippets
One transcription snippet per card — the highest-ranked line from the backend. Multiple hits are discoverable inside the document itself. Keeps card height consistent and result lists scannable.
The design avoids a dedicated snippet column for tag/sender matches (they're highlighted in place), which keeps the card layout clean. The only new visual element per card is a single italic line below the title — minimal footprint for a meaningful gain in search relevance clarity.
🏗️ Markus Keller — Architecture Discussion
Pre-implementation architectural decisions agreed with @marcel. Builds on the UI discussion above.
1. API response shape
Keep the existing two-step search flow (FTS UUIDs → JPA
Specificationentity load). No full query rewrite.DocumentSearchResultgains aMap<UUID, SearchMatchData>alongside the existingList<Document>— documents provide card fields, the map provides match overlay data keyed by document ID.DocumentSearchResult.of()factory updated accordingly.2. No HTML in the API — structured offsets only
Backend returns typed match data, never pre-rendered
<mark>HTML strings.{@html ...}in Svelte is an XSS vector and is avoided. The frontend applies a small utility function to split strings and insert<mark>elements at the specified offsets.3. Snippet extraction — lateral join on
transcription_blocksThe
search_vectorondocumentsaggregates all transcription block text into one blob —ts_headline()on it would not return individual lines. To return the best matching transcription line, a separate native SQL query with aLEFT JOIN LATERALontranscription_blocksis needed:No schema change required. If this becomes a bottleneck at scale, a
tsvectorcolumn ontranscription_blocksis the future fix.4.
SearchMatchDatarecord shapeExplicit typed fields — no generic
Map<String, List<MatchOffset>>that can't express per-entity granularity:Multi-value entity fields (receivers, tags) use matched ID lists so the frontend can highlight specific chips —
matchedFields: ["receivers"]would be ambiguous with multiple receivers.5. Service layer ownership
Snippet/highlight enrichment stays in
DocumentService. It's a direct enrichment of the search pipeline, not a separate domain concern. No new service.The existing
DocumentSearchResultDTO andsearchDocuments()signature need updates; the OpenAPI spec and frontend type generation will need a regeneration pass after this change.👨💻 Felix Brandt — Implementation Discussion
Pre-implementation decisions on the concrete coding contract. Builds on the UI and architecture discussions above.
1. Title offset computation —
ts_headline()with delimiter parsingSimple Java
indexOfis wrong for German FTS: inflected forms ("geschrieben") won't match a literal query ("schreiben"). Usets_headline()with custom delimiters in the same native SQL query used for the transcription lateral join:Java parses
«»delimiters to produceList<MatchOffset>. Same pattern applies to sender/receiver names.2. Enrichment location — private method inside
searchDocuments()searchDocuments()remains the single public entry point. A privateenrichWithMatchData(List<Document> docs, String query)runs one batched native SQL query for all matched document IDs. Controller does no orchestration — it callssearchDocuments()and gets a fully-enrichedDocumentSearchResultback.3. Frontend
applyOffsetsutility —TextSegment[], no{@html}Lives in
$lib/search.ts:Template iterates segments with
{#each}, wraps highlighted ones in<mark>. No{@html}, no XSS surface. Empty offsets returns a single plain segment — safe no-op.4. OpenAPI
@Schemaannotations@Schema(requiredMode = REQUIRED)on allSearchMatchDataandMatchOffsetfields excepttranscriptionSnippet(explicitly nullable — null when no transcription match). Collections are always non-null (empty list, never null).npm run generate:apiregeneration pass required after backend changes, committed alongside them.5. Test strategy
enrichWithMatchDatain isolation: mock the batched repository call, assert correct offset parsing from«»delimiters, correctmatchedTagIds, null snippet when no block matched.ts_headline()against real PostgreSQL 16. Insert document with known transcription blocks, assert correct snippet line and title offsets. This is the test that catches stemming and delimiter parsing bugs.applyOffsets: empty offsets, single offset, multiple offsets, offset at string boundaries.The key implementation detail that emerged:
ts_headline()with delimiter parsing is the only correct approach for title offsets given German FTS stemming. A naiveindexOfwould silently produce wrong highlights for inflected forms — which is the norm, not the edge case, in a 19th-century German family archive.👨💻 Felix Brandt — Senior Fullstack Developer
My pre-implementation design contract is already captured above. A few remaining implementation observations before we start coding:
Edge cases worth a second look
«»delimiter collision:ts_headline()withStartSel=«,StopSel=»is elegant, but«and»(\u00AB/\u00BB) are valid German typographic quotation marks and do appear in historical archive documents. If a title contains«Mein Leben», the«»parser will produce wrong offsets silently. Safer delimiters:\x01/\x02(ASCII control characters — cannot appear in text content), or a sequence like|||HL_START|||/|||HL_END|||. Worth confirming before the native query is written.Empty
docslist inenrichWithMatchData: AnIN ()clause with zero elements is invalid SQL in most dialects. The private method should short-circuit with an empty map whendocs.isEmpty(). This edge is easy to miss and would only surface on a search returning zero results.Zero-length query: When
q=is empty or absent,enrichWithMatchDatashould return an all-empty/nullSearchMatchDataper document and not call intots_headline()or the lateral join at all. Thewebsearch_to_tsquery('german', '')call raises an error in PostgreSQL.Test fixture detail
The Testcontainers integration test needs at least three transcription blocks per document with varying FTS rank — one block with the query term twice, one with it once, one without it — to actually verify that the highest-ranked block is selected and not just the first or last.
Question
Is
MatchOffset(int start, int length)measured in JavaStringcharacters (UTF-16 code units) or Unicode code points?ts_headline()operates on PostgreSQL's UTF-8 character positions. For ASCII-only Latin text this is irrelevant, but German historical documents may includeß,ü,ö,ä(all BMP, safe in UTF-16), and occasionally older ligatures. Worth verifying the offset semantics are consistent end-to-end before the parsing utility is written.🏗️ Markus Keller — Application Architect
The architectural decisions in my pre-implementation comment hold. A few structural questions that emerged when reviewing the full picture:
Questions & Observations
Short-circuit when no text query is present: The search endpoint accepts filter-only requests (by date, tag, sender) with no
q=parameter. In that case, the enrichment path —ts_headline(), the lateral join, offset parsing — must not run. ThesearchDocuments()signature should make this explicit: pass a nullableString querytoenrichWithMatchData, and return an emptyMapimmediately when it's null. Letting the enrichment run with a null query risks aNullPointerExceptioninside the native SQL call.DocumentSearchResultAPI change scope: AddingMap<UUID, SearchMatchData>to the response is a backward-incompatible change. Any existing API test (backend/api_tests/) that asserts on the search response shape will fail. Worth identifying those files before starting implementation so the test-update scope is known.Race condition on document deletion: The two-step flow (FTS UUIDs → JPA entity load) already exists. The lateral join in
enrichWithMatchDataruns after the JPA load, scoped toWHERE d.id IN (...). If a document is deleted between the JPA load and the enrichment query, the lateral join just returns no rows for that ID — the map entry is absent, which is safe. But this path should be explicitly covered: absent map entry → frontend treats it as no match data, not as an error.SearchMatchDataas a Java record vs. a class: The record declaration is the right call — it's immutable, structurally equal, and readable. Confirm it carries the@Schemaannotations correctly for OpenAPI generation. Java records support@Schemaon canonical constructor parameters, but the OpenAPI annotation placement differs from fields on a class. Worth verifying with a quick./mvnw clean package -DskipTests+ spec inspection before going further.Suggestion
The
DocumentSearchResult.of()factory is the one place that wires documents and match data together. Given that this method is now doing more work (keying a map, handling absent entries), it's worth naming clearly:DocumentSearchResult.withMatchData(List<Document>, Map<UUID, SearchMatchData>)would communicate intent better than an overloadedof().🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
The design decisions in my earlier comment cover the main cases well. A few accessibility and mobile details that need attention during implementation:
Accessibility of
<mark>in context<mark>is semantically correct for search highlights and most screen readers (NVDA 2024+, VoiceOver on macOS) announce it as "highlighted" or similar. However, JAWS on some configurations reads it without any special announcement, meaning a screen reader user won't know why a term is visually distinct. Worth adding a visually-hidden context hint — either a<span class="sr-only">near the search results heading (e.g. "Matches highlighted") oraria-labelon the results list. Not a blocker, but something to address before launch given the senior audience.Contrast check for tag chip highlight state
The default tag chip uses a border style (brand-sand border, white background). The matched-tag state switches to
bg-brand-mintfill. Need to confirm the tag label text color on that background:text-brand-navy(#002850 on #A6DAD8): contrast ≈ 9.6:1 — AAA ✓text-gray-600or similar: verify before codingWhen in doubt, set matched-tag text explicitly to
text-brand-navyso the contrast is guaranteed.Mobile: snippet line on small screens
On 320px viewports, the snippet line
text-sm text-gray-600 font-serif italicsits between the title and the metadata row. A 200-character snippet will wrap to 4–5 lines on mobile, which can make the card feel dense. Recommend addingline-clamp-2(Tailwind's[@supports(display:-webkit-box)]:line-clamp-2) to limit the visible snippet to two lines with a trailing ellipsis. This preserves card scannability on mobile while keeping the full snippet reachable via the document detail view.Spacing token between snippet and metadata row
The design comment doesn't specify the vertical gap between the snippet line and the date/sender row. Suggest
mt-1.5(6px) — enough to visually separate them without adding significant height to the card. Should be specified so implementation is consistent across all result cards.🧪 Sara Holt — QA Engineer & Test Strategist
Felix's test strategy covers the main cases. Here's what I'd want filled in before implementation starts:
Missing acceptance criteria
The issue describes the feature and open questions but has no AC. Before picking this up, define the minimum to call it done:
"schreiben"highlightgeschriebenin a title? (Stemming)q=) show no highlights? (No false positives)Without AC, "done" is subjective and scope creep risk is real.
Edge cases Felix's test plan doesn't explicitly cover
"Brandt Brand"could produce two offsets that overlap ([0,5]and[0,8]on "Brandt"). What doesapplyOffsetsdo? Merge to longest? Take first? Silently produce garbled output? This needs a Vitest test and a defined behavior before implementation.q=): What does the API return when called withq=? The enrichment should produce all-emptySearchMatchData, not an error. Integration test needed.transcriptionSnippetis null, no 500.ts_headline()returns the original title text unmodified (no delimiters). The Java parser must produce an emptyList<MatchOffset>, not throw on the absence of delimiters. Unit test needed.Test naming reminder
Felix mentioned test names but didn't list them — worth writing them out in the issue so they can be checked off during code review:
CI time impact
The Testcontainers integration test spins up a real PostgreSQL container. If the search integration tests share a container with existing tests (via
@Container staticat the class level), the overhead is just the test time itself. If it needs a fresh container, add ~10–15 seconds. Worth verifying the total integration test suite stays under the 2-minute target.🔐 Nora Steiner — Application Security Engineer
Overall the design is sound from a security perspective. The
{@html}-free approach withTextSegment[]is exactly the right call. A few things worth verifying before and during implementation:✅ What's already right
The decision to return structured offsets from the backend rather than pre-rendered HTML eliminates the XSS surface entirely. This is the correct architecture. A backend that returned
<mark>strings would require{@html}on the frontend, which is an XSS vector — the design correctly avoids this. No concerns there.Offset bounds validation in
applyOffsetsThe
applyOffsetsfrontend utility receives offset data from the backend. In normal operation the backend is trusted, but defensive validation in the frontend utility is still good practice:A buggy backend (or future regression in the enrichment SQL) that returns out-of-range offsets should produce no highlight, not a runtime error. JavaScript's
slice()handles out-of-bounds gracefully, but explicit filtering makes the intent clear and prevents potential downstream issues.Authorization on the enriched search endpoint
Confirm
GET /api/documents?q=...is covered by@RequirePermission(READ_ALL). The search results now include transcription text snippets — actual content from the documents. An unauthenticated or under-privileged call leaking these snippets is a more impactful information disclosure than leaking just the document title. This permission check should already be in place, but it's worth verifying explicitly given the new content being exposed.Delimiter injection via document content
Felix's comment flags the
«»collision risk from a correctness angle. From a security angle: if document titles are user-controlled (they are — editors can set any title), a crafted title like«normal text» fake highlight «real content»could manipulate the offset parser into producing false highlights. This is low-severity (no code execution, just misleading UI), but it's another argument for choosing\x01/\x02delimiters that cannot appear in user-supplied content.No concerns on the backend SQL
ts_headline()and the lateral join use parameterized queries (:querynamed parameter) — no SQL injection surface. The FTS input goes throughwebsearch_to_tsquery()which handles sanitization internally. Clean.⚙️ Tobias Wendt — DevOps & Platform Engineer
No new infrastructure needed for this feature. That's good. A few observations from the platform side:
OpenAPI spec drift should be enforced in CI
The implementation plan notes that
npm run generate:apimust be committed alongside backend changes. Currently there's no CI gate that verifies the generated TypeScript types match the live spec. WithSearchMatchDataandMatchOffsetbeing new types, a future refactor that changes the record shape without regenerating types would fail silently at runtime. Consider adding a CI step:This catches spec drift at PR time, not in production.
Query performance: slow query logging
The lateral join is a new SQL pattern in this codebase. It runs on every text search. At 1,500 documents it should be fast; at 15,000 it may not be. Is slow query logging enabled on PostgreSQL? The default
log_min_duration_statementin the Docker Compose setup is worth checking. Setting it to500msinapplication-dev.yamlor the Compose environment would surface regressions during development before they reach production.If the current compose file doesn't set this, a one-line addition covers it:
Future index tracked?
Markus's comment notes: "If this becomes a bottleneck at scale, a
tsvectorcolumn ontranscription_blocksis the future fix." Recommend opening a follow-up issue to track this before closing this one, so it doesn't get lost as the archive grows. The ideal trigger: runEXPLAIN ANALYZEon the lateral join after go-live; if seq scan ontranscription_blocksappears, that's the signal.No other infra changes needed
No new Docker services, no config changes, no MinIO interaction, no additional CI secrets. The OpenAPI regen is an existing script. CI impact is limited to the new Testcontainers integration test (within the 2-minute budget per Sara's note). All good here.
Implemented in PR #242
All acceptance criteria from the issue have been implemented on branch
feat/issue-219-search-snippets.Commits
47da0fa—feat(search): add MatchOffset record for character-level highlight positions8cbecd4—feat(search): add SearchMatchData record for per-document match signals003d68e—feat(search): add DocumentSearchResult.withMatchData() factory with match overlay map8526e6c—test(search): add DocumentSearchEnrichmentTest for findEnrichmentData native query8c7ce14—feat(search): enrich searchDocuments with per-document match data9673cef—feat(search): add applyOffsets utility and regenerate API types with MatchOffset/SearchMatchData93c7843—feat(search): render title highlights and transcription snippets in DocumentListddb87db—feat(search): pass matchData from server load to DocumentListWhat was built
Backend:
MatchOffsetrecord (start,lengthas Javacharpositions, JS-compatible)SearchMatchDatarecord:transcriptionSnippet,titleOffsets,senderMatched,matchedReceiverIds,matchedTagIdsDocumentRepository.findEnrichmentData(): native SQL with lateral join +ts_headlineusingchr(1)/chr(2)sentinel delimiters to avoid regex ambiguityDocumentService.enrichWithMatchData(): parses headlines →List<MatchOffset>, short-circuits for empty queriesDocumentSearchResultwraps documents + matchData; controller returns it directlyFrontend:
applyOffsets(text, offsets)utility in$lib/search.ts: sorts, merges overlapping spans, splits string intoTextSegment[]DocumentListrenders<mark>elements for highlighted title terms (no{@html})data-testid="search-snippet")Tests added: 13 backend integration tests, 3 service tests, 10
applyOffsetsunit tests, 4 component tests. All 832 frontend tests green.