Files
familienarchiv/docs/adr/008-fts-sql-pagination.md
2026-05-09 16:35:01 +02:00

3.7 KiB

ADR-008: SQL-level pagination for full-text search via window-function CTE

Status

Accepted

Context

DocumentRepository.findAllMatchingIdsByFts (formerly findRankedIdsByFts) returns all matching document IDs for a FTS query. DocumentService.searchDocuments then paginates in memory on the RELEVANCE sort path.

A pre-production audit against 1,520 documents measured:

rows_per_call: 911 / call  (query: "walter")

At current scale this is acceptable — 911 UUIDs ≈ 14 KB, ms-level DB time. At 100 K+ documents two failure modes emerge:

  1. Memory: a broad query returns ~60 K UUIDs ≈ 1 MB per request, multiplied by concurrent users.
  2. Latency: the LATERAL join does work proportional to match-set size; at 60 K matches the FTS step alone exceeds 100 ms per query.

Tracked as finding F-31 (High) in the pre-production architectural review.

Decision

Push pagination and rank ordering into SQL for the RELEVANCE sort path when no non-text filters are active (pure full-text search):

WITH q AS (
  SELECT CASE WHEN websearch_to_tsquery('german', :query)::text <> ''
              THEN to_tsquery('simple', regexp_replace(
                       websearch_to_tsquery('german', :query)::text,
                       '''([^'']+)''', '''\\1'':*', 'g'))
         END AS pq
), matches AS (
  SELECT d.id, ts_rank(d.search_vector, q.pq) AS rank
  FROM documents d, q
  WHERE d.search_vector @@ q.pq
)
SELECT id, rank, COUNT(*) OVER () AS total
FROM matches
ORDER BY rank DESC, id
OFFSET :offset LIMIT :limit

COUNT(*) OVER () returns the full match count alongside each page row in a single round-trip — no separate count query needed.

rows_per_call for the FTS query drops from match-set size (911) to page size (≤ 50).

When non-text filters (date range, sender, receiver, tags, status) are also active, the existing path is preserved: findAllMatchingIdsByFts returns all ranked IDs, which are passed as an IN clause to the JPA Specification, and totalElements comes from the JPA Page.getTotalElements(). This keeps the count accurate across the combined filter set.

Alternatives Considered

1. Two-query approach (separate COUNT + paged SELECT) Correct, but doubles round-trips. The window function achieves the same result in one query.

2. Capped result set with a user-visible warning Return at most N results (e.g. 500) and show "showing top 500 of many results". Simpler, but degrades UX for broad queries and doesn't reduce latency proportionally (still scans N rows).

3. Full SQL rewrite combining FTS + JPA Specification filters Possible via a native query that embeds all filter predicates. Eliminates the in-memory SENDER/RECEIVER sort paths and the two-phase approach. High complexity, tight coupling to schema details, loses type-safe JPA Specification composition. Deferred to a future refactor if scale demands it.

Consequences

  • rows_per_call for pure-text FTS searches drops to ≤ page size — the primary metric.
  • SENDER and RECEIVER sort paths stay in-memory for combined text+filter queries. For pure-text queries with SENDER/RECEIVER sort, the current approach (fetch all matched IDs, build spec, load all matched entities, sort in-memory) still runs. This is acceptable while the archive stays under ~10 K documents.
  • RELEVANCE sort with text+filters still loads the full filtered entity set in-memory. The filtered set is typically much smaller than the raw FTS match set, so the cost is bounded by filter selectivity, not total match count.
  • findAllMatchingIdsByFts is retained for: (a) the bulk-edit "select all" fast path (findIdsForFilter), (b) the document density chart (getDensity), and (c) the SENDER/RECEIVER in-memory sort paths.