docs(adr): ADR-008 SQL-level FTS pagination via window-function CTE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
68
docs/adr/008-fts-sql-pagination.md
Normal file
68
docs/adr/008-fts-sql-pagination.md
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
# ADR-008: SQL-level pagination for full-text search via window-function CTE
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`DocumentRepository.findAllMatchingIdsByFts` (formerly `findRankedIdsByFts`) returns all matching document IDs for a FTS query. `DocumentService.searchDocuments` then paginates in memory on the RELEVANCE sort path.
|
||||||
|
|
||||||
|
A pre-production audit against 1,520 documents measured:
|
||||||
|
|
||||||
|
```
|
||||||
|
rows_per_call: 911 / call (query: "walter")
|
||||||
|
```
|
||||||
|
|
||||||
|
At current scale this is acceptable — 911 UUIDs ≈ 14 KB, ms-level DB time. At 100 K+ documents two failure modes emerge:
|
||||||
|
|
||||||
|
1. **Memory**: a broad query returns ~60 K UUIDs ≈ 1 MB per request, multiplied by concurrent users.
|
||||||
|
2. **Latency**: the `LATERAL` join does work proportional to match-set size; at 60 K matches the FTS step alone exceeds 100 ms per query.
|
||||||
|
|
||||||
|
Tracked as finding **F-31 (High)** in the pre-production architectural review.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Push pagination and rank ordering into SQL for the RELEVANCE sort path when no non-text filters are active (pure full-text search):
|
||||||
|
|
||||||
|
```sql
|
||||||
|
WITH q AS (
|
||||||
|
SELECT CASE WHEN websearch_to_tsquery('german', :query)::text <> ''
|
||||||
|
THEN to_tsquery('simple', regexp_replace(
|
||||||
|
websearch_to_tsquery('german', :query)::text,
|
||||||
|
'''([^'']+)''', '''\\1'':*', 'g'))
|
||||||
|
END AS pq
|
||||||
|
), matches AS (
|
||||||
|
SELECT d.id, ts_rank(d.search_vector, q.pq) AS rank
|
||||||
|
FROM documents d, q
|
||||||
|
WHERE d.search_vector @@ q.pq
|
||||||
|
)
|
||||||
|
SELECT id, rank, COUNT(*) OVER () AS total
|
||||||
|
FROM matches
|
||||||
|
ORDER BY rank DESC, id
|
||||||
|
OFFSET :offset LIMIT :limit
|
||||||
|
```
|
||||||
|
|
||||||
|
`COUNT(*) OVER ()` returns the full match count alongside each page row in a single round-trip — no separate count query needed.
|
||||||
|
|
||||||
|
`rows_per_call` for the FTS query drops from match-set size (911) to page size (≤ 50).
|
||||||
|
|
||||||
|
When non-text filters (date range, sender, receiver, tags, status) are also active, the existing path is preserved: `findAllMatchingIdsByFts` returns all ranked IDs, which are passed as an `IN` clause to the JPA Specification, and `totalElements` comes from the JPA `Page.getTotalElements()`. This keeps the count accurate across the combined filter set.
|
||||||
|
|
||||||
|
## Alternatives Considered
|
||||||
|
|
||||||
|
**1. Two-query approach (separate COUNT + paged SELECT)**
|
||||||
|
Correct, but doubles round-trips. The window function achieves the same result in one query.
|
||||||
|
|
||||||
|
**2. Capped result set with a user-visible warning**
|
||||||
|
Return at most N results (e.g. 500) and show "showing top 500 of many results". Simpler, but degrades UX for broad queries and doesn't reduce latency proportionally (still scans N rows).
|
||||||
|
|
||||||
|
**3. Full SQL rewrite combining FTS + JPA Specification filters**
|
||||||
|
Possible via a native query that embeds all filter predicates. Eliminates the in-memory SENDER/RECEIVER sort paths and the two-phase approach. High complexity, tight coupling to schema details, loses type-safe JPA Specification composition. Deferred to a future refactor if scale demands it.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- **`rows_per_call` for pure-text FTS searches drops to ≤ page size** — the primary metric.
|
||||||
|
- **SENDER and RECEIVER sort paths stay in-memory** for combined text+filter queries. For pure-text queries with SENDER/RECEIVER sort, the current approach (fetch all matched IDs, build spec, load all matched entities, sort in-memory) still runs. This is acceptable while the archive stays under ~10 K documents.
|
||||||
|
- **RELEVANCE sort with text+filters still loads the full filtered entity set in-memory.** The filtered set is typically much smaller than the raw FTS match set, so the cost is bounded by filter selectivity, not total match count.
|
||||||
|
- **`findAllMatchingIdsByFts` is retained** for: (a) the bulk-edit "select all" fast path (`findIdsForFilter`), (b) the document density chart (`getDensity`), and (c) the SENDER/RECEIVER in-memory sort paths.
|
||||||
Reference in New Issue
Block a user