fix(documents): paginate FTS match-set in SQL instead of loading all matching IDs #468
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
backend/src/main/java/org/raddatz/familienarchiv/document/DocumentRepository.java::findRankedIdsByFtsreturnsList<UUID>of all matching document IDs for a FTS query. The controller (#315 added pagination) then paginates in memory.Pre-prod audit ran the query against the live stack:
So a search like "walter" returns ~60% of the document table as UUIDs, then the controller slices the first 50.
At current scale this is fine — 911 UUIDs is ~14 KB of data and ms-level DB time. But two scaling issues:
LATERAL JOINdoes work proportional to match-set size; at 100K rows with 60K matches that's 100 ms+ of CPU per query, not ms.Approach
Push pagination + sort into the SQL. Two sub-queries instead of one:
Query 1: paged ID + total count
The window-function
COUNT(*) OVER ()returns total-match-count alongside the page rows in a single round-trip — no separate count query.Query 2: detail enrichment for the page
Already in place —
DocumentService.searchMatchDataruns against the page IDs only. Keep it, but don't pass it the full ID list — pass the 50-ID page.Repository signature
Service refactor
DocumentService.searchDocumentscurrently callsfindRankedIdsByFts(query)then in-memory paginates. New flow:findFtsPage(query, page, size)→ returns 50 IDs + total.searchMatchData(50 ids)enrichment unchanged.buildResultunchanged.totalElements,totalPages,number,size) populated from the newtotal.Sort modes
Most sort modes (
RELEVANCE,DATE,TITLE,UPLOAD_DATE) move into the SQLORDER BY. The two in-memory ones (SENDER,RECEIVER) stay in-memory but now operate on a 50-row page, not a 911-row list.Critical files
backend/src/main/java/org/raddatz/familienarchiv/document/DocumentRepository.java— newfindFtsPage+ nested typesbackend/src/main/java/org/raddatz/familienarchiv/document/DocumentService.java::searchDocuments— switch to paged repobackend/src/main/java/org/raddatz/familienarchiv/document/DocumentSearchSpec.java— update Specification for non-FTS filter combinationbackend/src/test/java/.../document/DocumentSearchPagedIntegrationTest.java— extend with assert on rows-per-callDocumentSearchResultshape stays the same per #315.Verification
pg_stat_statements: hit/api/documents/search?q=walter&size=50&page=0. The oldSELECT d.id FROM documents drow should drop from 911 rows/call to 50 rows/call.Acceptance criteria
findRankedIdsByFtsremoved (or kept only for tests).OFFSET ... LIMIT ...pushed down to Postgres.pg_stat_statementsconfirmsrows_per_callfor the FTS query is ≤ page size.Effort
M — 1-2 days. The window-function approach is straightforward; the test surface area is the bigger time cost.
Risk if not addressed
At 100K+ documents, search response time grows linearly with match-set size. First user-visible 5xx is likely a search-induced GC pause or pool starvation.
Tracked in audit doc as F-31 (High) — new dynamic finding from
pg_stat_statementsanalysis.🏗️ Markus Keller — Application Architect
Observations
The current
findRankedIdsByFtsquery is structurally sound for the domain — it uses the GIN-indexedsearch_vectorcolumn from V34 and theCROSS JOIN LATERALpattern to avoid function calls per row. The problem is not the query itself but where the pagination boundary sits: in the application service rather than in SQL.The issue proposes moving to a CTE with a window-function
COUNT(*) OVER (). This is the right direction. The window function eliminates the classic "two-query" total-count problem and is well-supported in Postgres 16.However, two things need a closer look:
The
findRankedIdsByFtscaller infindIdsForFilter(line 419 ofDocumentService) is the bulk-edit "select all" fast path. That method intentionally loads all IDs unpaginated — it is not a bug there, it is the feature. The refactor must not change this method's semantics. The issue description focuses onsearchDocumentsonly, which is correct, but the implementation needs to cleanly separate the two call sites and avoid accidentally pushing pagination into the bulk-edit path.The
RELEVANCEsort path is currently the third in-memory sort (alongside SENDER and RECEIVER). The proposed window-function approach pushests_ranksorting into SQL, which eliminates the in-memory RELEVANCE sort entirely. That is a genuine improvement. However, SENDER and RECEIVER still fetch the full match set (they requireLEFT JOINto handle null senders/receivers and cannot be expressed with the simpleORDER BYthe fast path uses). The issue acknowledges this correctly — those two stay in-memory. The code comment at line 589–593 documents this design consciously.The
default FtsPage findFtsPage(...)default method on the repository interface is a valid pattern in Spring Data, but mixing record types (FtsHit,FtsPage) and rawObject[]parsing inside a repository default method puts business logic in the wrong layer. Prefer moving theObject[]→FtsHitmapping to the service, not the repository.No ADR is proposed. Pushing pagination into SQL for a native FTS query is a significant enough architectural decision to warrant a short ADR — especially because it constrains the SENDER/RECEIVER sort paths permanently (they must stay in-memory).
Recommendations
findRankedIdsByFtsin the repository untouched and letfindIdsForFiltercontinue to use it. Add a newfindFtsPageRawquery alongside it rather than replacing it, to avoid breaking the bulk-edit caller.Object[]→ record mapping out of the repository default method and into a private service method (toFtsHit(Object[] row)). Repositories should return data; services should interpret it.docs/adr/.DocumentSearchSpec.javareference in the issue body does not match any file in the codebase — the class isDocumentSpecifications.java. Clarify before implementation to avoid confusion.Open Decisions
FtsHitandFtsPage: placing them as nested types insideDocumentRepository(as the issue shows) couples the repository to value-object semantics. Placing them in thedocumentpackage root is slightly better. Either is defensible for a codebase this size — pick one and be consistent.findRankedIdsByFtspermanently or deprecate it: the method is exercised by 16 test assertions inDocumentFtsTest. If it is kept only forfindIdsForFilter, rename it tofindAllMatchingIdsByFtsto communicate intent. If it is deprecated, the test file needs a new approach for those assertions.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
I read the current
searchDocumentsmethod carefully (lines 577–619 ofDocumentService). Three separate in-memory paths exist today:pageSlicerankMap, sorts, thenpageSliceAll three call
documentRepository.findAll(spec)with noPageable, loading the full entity set. The fast path at line 617 is the only one that pushesPageableinto the DB. The issue correctly identifies that RELEVANCE should join the fast path after this refactor.The proposed
findFtsPageRawreturnsList<Object[]>. ParsingObject[]by column index is fragile — a future column reorder or addition silently breaks the mapping with no compile-time signal. Thedefault FtsPage findFtsPage(...)approach in the issue would contain the parsing, but it still buries an index-brittle cast chain inside the repository.The
searchDocumentsmethod signature is already long (11 parameters). The refactor should not make it longer.The
DocumentSearchResult.of(List.of())early-return pattern at line 583 (empty FTS result) must be preserved exactly — the newfindFtsPagemust returnFtsPage(List.of(), 0)for an empty match set so the caller short-circuit still works cleanly.One clean-code concern: the issue's proposed CTE comment says
/* meta_date DESC NULLS LAST */as a commented-out secondary sort. If it is intended for production, it should be uncommented. If it is a note for the implementation phase, remove it before committing — commented-out code is not documentation.Recommendations
Object[]column indexes as named constants at the top of the parsing method, not inline casts:Object[]→FtsHitparsing inDocumentService.toFtsPage(List<Object[]>), not in the repository. The repository's job ends at returning rows.findIdsForFiltermethod (bulk-edit fast path, line 413) must NOT be touched — it intentionally loads all IDs. Guard against this with a comment on the newfindFtsPageRawmethod:// Paged path only — use findRankedIdsByFts for unpaged bulk-edit.RELEVANCEsort branch (lines 604–613) should be deleted entirely after the refactor, sincets_rank DESCin the SQL now handles it. Delete it, don't comment it out.SENDERandRECEIVERsort branches (lines 594–600) are now operating on a 50-row page instead of a 911-row list. Add a one-line comment to each branch noting that:// In-memory sort on page slice (≤ page size rows) — acceptable.🚀 Tobias Wendt — DevOps & Platform Engineer
Observations
The issue's verification step says: "hit
/api/documents/search?q=walter&size=50&page=0. The oldSELECT d.id FROM documents drow should drop from 911 rows/call to 50 rows/call." That requirespg_stat_statementsto be enabled in the running stack. Looking at the currentdocker-compose.ymland the audit finding F-24,pg_stat_statementsis NOT enabled by default — thedbservice has nocommand: postgres -c shared_preload_libraries=pg_stat_statements ...override.The audit (F-24) already tracked this as a Medium finding. This verification step in issue #468 therefore cannot be executed against the live stack today without first enabling
pg_stat_statements. That's a prerequisite that should be noted explicitly in the acceptance criteria.The Testcontainers integration tests (PostgreSQL 16-alpine) run with a stock Postgres config and will not have
pg_stat_statementsavailable either unless the container is started with--commandoverrides. The issue proposes a "rows_per_call" assertion inDocumentSearchPagedIntegrationTest— that assertion cannot be made viapg_stat_statementsin Testcontainers unless the test explicitly enables it via SQL (CREATE EXTENSION IF NOT EXISTS pg_stat_statements).The
DirtiesContext(AFTER_EACH_TEST_METHOD)annotation in the existingDocumentSearchPagedIntegrationTestis expensive — it rebuilds the Spring context after every test. That's inherited from the PR #316 setup. Adding 3–4 more large-fixture tests to this class will noticeably slow CI.Recommendations
dbservice indocker-compose.yml:EXPLAIN (ANALYZE, FORMAT JSON)instead ofpg_stat_statementsin Testcontainers. It does not require the extension and gives exact row counts from the query plan:result.items().size() == pageSizeandresult.totalElements() == 911— this is behavioral proof without requiring query-level instrumentation.DocumentFtsPagedIntegrationTestclass and annotating it@DirtiesContext(classMode = AFTER_CLASS)rather thanAFTER_EACH_TEST_METHOD, so the Spring context is rebuilt once per class, not after every test. This alone cuts CI overhead significantly for large Testcontainers tests.Open Decisions
pg_stat_statementsbe enabled in the production Compose file only (to avoid local-dev slowness), or in all environments? The extension adds ~5% query overhead. For a project at this scale it is negligible, but the choice has operational implications for thedocker-compose.ymlbase file vs. adocker-compose.prod.ymloverlay.📋 Elicit — Requirements Engineer
Observations
The issue is well-specified for a performance bug: it gives measured baseline data (911 rows/call at 1,520 documents), a concrete scaling model (linear growth at 100K), a proposed solution, and five acceptance criteria. The effort estimate (M, 1–2 days) is reasonable given the surface area described.
However, two requirements gaps need attention:
Gap 1: The
findIdsForFilterpath is unspecified.DocumentService.findIdsForFilter(line 413) also callsfindRankedIdsByFtsand intentionally returns all matching IDs unpaginated — that is the contract for the bulk-edit "select all" feature. The issue's acceptance criterion "findRankedIdsByFts removed" would break this path. The requirement needs to explicitly state what happens tofindIdsForFilter: does it keep callingfindRankedIdsByFts, does it get a different new query, or does it callfindFtsPageRawrepeatedly? This is not a minor detail — it is a functional regression risk.Gap 2: The empty-query behavior is under-specified.
The issue says "empty search match-set returns total=0 + no error" but the current
CROSS JOIN LATERALapproach handles a subtler case: whenwebsearch_to_tsquery('german', :query)returns an empty tsquery (e.g., for stopword-only input like "der die das"), the lateral subquery yieldsNULLforpq, and theWHERE d.search_vector @@ q.pqclause correctly returns zero rows. The new CTE must preserve this behavior. The acceptance criterion should be: "a query consisting entirely of German stopwords returns total=0 without an error."Gap 3: The
sort=RELEVANCEpath change is implicit.The current RELEVANCE sort fetches all matching documents and sorts them in-memory by their position in
rankedIds. After this refactor,ts_rank DESCin the SQL replaces this. This is a behavioral change: the current in-memory sort uses the full ranked list from a separate query (which includes prefix-match wildcards via the LATERAL hack); the new SQL sort ranks only the OFFSET/LIMIT subset. For pages beyond page 0, this means rank ordering is computed within the page, not globally. This is actually correct behavior (consistent with how pagination + relevance works everywhere), but it should be called out as a deliberate behavior change, not an invisible side effect.Recommendations
findIdsForFiltermethod (bulk-edit fast path) is unaffected — it continues to return all matching IDs without pagination."DocumentSearchSpec.javain the "Critical files" section does not exist — the actual file isDocumentSpecifications.java. Fix this before implementation to avoid confusion.🔒 Nora "NullX" Steiner — Security Engineer
Observations
The FTS query uses
websearch_to_tsquery('german', :query)with a named JPA parameter:query. This is correctly parameterized — no SQL injection risk from the query string. TheLATERALCTE and window function approach in the proposed query preserves this parameterization. No new injection surface is introduced.One security-adjacent concern: the current
findRankedIdsByFtsreturns a rawList<UUID>, which is then fed asIN :idstofindEnrichmentData. TheINclause with named parameters in JPA is safe — Hibernate generates a parameterizedIN (?, ?, ...)rather than string interpolation. The newfindFtsPageRawapproach bypasses thisINclause for the ID list by directly returning the page;findEnrichmentDatais then called with only the 50-ID page. This is equally safe and actually reduces theINclause size from potentially 911 items to 50, which also avoids the PostgresINclause overhead at scale.The
Object[]row mapping proposed in the issue casts column 0 toUUID. In Postgres + Hibernate native queries, UUID columns may return asjava.util.UUIDor asStringdepending on the JDBC driver and column type. If the mapping is(UUID) r[0]and the driver returns aString, this throws aClassCastExceptionat runtime — not a security issue but a reliability one that could surface as a 500 in production while silently exposing an error trace. UseUUID.fromString(r[0].toString())defensively.No
@RequirePermissionchange is involved — the search endpoint is read-only and the permission model is unchanged. The refactor is scope-limited to the repository/service internals.Recommendations
Object[]mapping:java.util.UUID(notString) — this has surprised teams before and is worth confirming once at the integration test layer.ts_rankcolumn (column 1) is afloat8in Postgres. Cast as((Number) r[1]).doubleValue()(as the issue shows) is correct —Numberis the safe supertype for bothFloatandDoubledriver returns.🧪 Sara Holt — QA Engineer & Test Strategist
Observations
Test coverage for the current FTS path is extensive:
DocumentFtsTest(16 tests against real Postgres via@DataJpaTest+ Testcontainers) coversfindRankedIdsByFtsdirectly.DocumentSearchPagedIntegrationTest(5 tests) covers the service-level slice/total arithmetic.DocumentServiceSortTestandDocumentServiceTestcover the service layer via Mockito mocks that stubfindRankedIdsByFts.The refactor touches three distinct layers, each requiring test updates:
Layer 1 — Repository (
findFtsPageRaw): The existingDocumentFtsTeststubsfindRankedIdsByFts. Once the newfindFtsPageRawexists alongside it, the test class needs new test methods asserting that:findFtsPageRawreturns ≤pageSizerowstotalin the first row equals the full match count, not the page sizeLayer 2 — Service unit tests (
DocumentServiceSortTest,DocumentServiceTest): Everywhen(documentRepository.findRankedIdsByFts(...))stub will need a companionwhen(documentRepository.findFtsPageRaw(...))stub for the RELEVANCE path. IffindRankedIdsByFtsis kept for the bulk-edit path butfindFtsPageRawhandles the search path, the mocks need to stay aligned. There are at least 5 places inDocumentServiceTestand 3 inDocumentServiceSortTestthat will break.Layer 3 — Integration (
DocumentSearchPagedIntegrationTest): The issue requests adding an assertion for "rows-per-call ≤ page size". As Tobias notes, this cannot be done viapg_stat_statementsin Testcontainers without enabling the extension. UseEXPLAIN ANALYZEin a dedicated test or accept behavioral assertion (result size == page size) as sufficient proof. The existing 5 tests in this class do not exercise FTS at all — they usenulltext queries. New tests need to seed a large fixture and use a real FTS query to prove the paged behavior.The
DirtiesContext(AFTER_EACH_TEST_METHOD)inDocumentSearchPagedIntegrationTestis a performance smell. With 120-document fixture seeding in@BeforeEachand context rebuild after each test, this class is already slower than necessary. Adding a 1,520-document FTS fixture would make it significantly worse. Recommend splitting into a separate class with@DirtiesContext(AFTER_CLASS).Recommendations
DocumentFtsPagedIntegrationTestwith:@DirtiesContext(classMode = AFTER_CLASS)(not AFTER_EACH)@BeforeAll(use@TestInstance(PER_CLASS))ftsPage_firstPage_returns50Items_and_totalEquals911()ftsPage_lastPage_returnsRemainder_andSameTotalEquals911()ftsPage_emptyMatchSet_returnsZeroTotal_noException()ftsPage_stopwordOnlyQuery_returnsZeroTotal_noException()ftsPage_pageSize50_resultSizeIsAtMost50()DocumentServiceSortTestto add a test proving RELEVANCE sort now usesfindFtsPageRaw, notfindRankedIdsByFts— mock both, assert onlyfindFtsPageRawis called on the RELEVANCE path.ftsPage_firstPage_returns50Items_and_totalEquals911()first, verify it fails with the old implementation (returns 911 items), then implementfindFtsPageRaw.Open Decisions
DocumentFtsTestassertions againstfindRankedIdsByFtsstill own the linguistic correctness (stemming, wildcard, stopword) tests after the refactor? Yes — those tests verify query semantics, not pagination. They should stay as-is regardless of whetherfindRankedIdsByFtsis retained or deprecated. If deprecated, migrate the assertions tofindFtsPageRawwithtotal > 0assertions instead of checking the ID list directly.🎨 Leonie Voss — UX Design Lead & Accessibility Strategist
Observations
This is a pure backend/database performance refactor with no planned frontend changes. The issue explicitly states: "No frontend change —
DocumentSearchResultshape stays the same per #315." From a UX perspective, this is the right scope decision.That said, there are two user-visible behaviors worth verifying:
1. Pagination total count accuracy: The current in-memory path returns
sorted.size()astotalElements(from the full fetched list). The new window-function approach returnsCOUNT(*) OVER ()from Postgres. These should agree for every filter combination, but there is a subtle case: if the user searches with both a text query AND filters (date range, sender, tag), the CTE computesCOUNT(*) OVER ()after theWHERE d.search_vector @@ q.pqbut before any JPA Specification filters (sender, date, tag) are applied. The Specification filters happen at thefindAll(spec, pageRequest)call, not inside the native SQL. If the total displayed to the user comes from the window function alone, it would show the FTS match count, not the combined filter + FTS match count. This could cause the UI to show "911 results" while only 23 are visible after applying a date filter — confusing for the user.2. Sort mode behavior for RELEVANCE on page > 0: As Elicit noted, relevance is now ranked within the page rather than globally. A user jumping to page 5 of a relevance-sorted search will see results ranked by local page relevance, which may not feel continuous with page 4. At current data volume (1,520 docs), users rarely reach page 5 for a broad query, but it is worth documenting.
Recommendations
totalElementsin the API response matches the actual number of results a user would receive across all pages when both text and filters are combined. Specifically, check:GET /api/documents/search?q=walter&sender=<uuid>&size=50&page=0— doestotalElementsmatch the count of documents that match both the FTS query AND the sender filter?COUNT(*) OVER ()is computed inside the native SQL (before JPA Specification filters are applied), the fix is to keep thePage.getTotalElements()from the JPAfindAll(spec, pageRequest)call for filtered queries, and only use the window-function total for pure FTS queries. The issue's proposed flow conflates these two; make the handling explicit in the implementation.DocumentSearchResultrecord's field types and names are stable — the TypeScript consumer will see no difference.Open Decisions
Decision Queue — Open Items for Human Judgment
Grouped from the persona reviews above. Each item is a genuine tradeoff needing a decision before or during implementation.
Theme 1: Scope of
findRankedIdsByFtsafter the refactorFrom: Markus, Felix, Elicit, Sara
findRankedIdsByFtshas two callers:searchDocuments— the paged search path (target of this fix)findIdsForFilter— the bulk-edit "select all" path (intentionally unpaginated, must not change)The acceptance criterion "findRankedIdsByFts removed" in the issue would break the bulk-edit path.
Decision needed: Keep
findRankedIdsByFtsforfindIdsForFilter(rename tofindAllMatchingIdsByFtsfor clarity) and addfindFtsPageRawas a second query? Or replace both with a unified query and rebuild the bulk-edit path to iterate pages? The simplest answer is keep+rename.Theme 2: Where the
totalElementscount comes from when filters AND text are combinedFrom: Leonie, Elicit
The proposed window function
COUNT(*) OVER ()runs inside the native SQL — it counts FTS matches only, before JPA Specification filters (date range, sender, tag) are applied. If a user searchesq=walter&sender=<uuid>, the window function returns 911 (all Walter matches) but the JPA spec filters reduce the visible results to, say, 23.The UI result count chip would show "911 results" while only 23 appear — a serious UX inconsistency.
Decision needed: For combined text+filter queries, use
Page.getTotalElements()fromdocumentRepository.findAll(spec, pageRequest)as the source of truth fortotalElements(as the existing fast path already does for DATE/TITLE/UPLOAD_DATE sorts). Only use the window-function total when there are no active filters. This is likely the correct answer — confirm before implementation.Theme 3:
pg_stat_statementsindocker-compose.yml— base file or prod overlay only?From: Tobias
Enabling
pg_stat_statementsin the basedocker-compose.ymlmeans it is always-on in dev. It adds ~5% query overhead. The alternative is to add it only to adocker-compose.prod.ymloverlay, but that means the verification step ("rows_per_call drops from 911 to 50") cannot be run locally.Decision needed: Enable in the base file (simplest, unblocks F-24 and this issue's verification in one change) or overlay only (lower dev overhead, more operational complexity).
Theme 4:
FtsHit/FtsPagerecord placementFrom: Markus
The issue shows these as nested types inside
DocumentRepository. This leaks value-object semantics into the repository layer.Decision needed: Define them in the
documentpackage root (alongsideDocumentSearchResult,DocumentSearchItem, etc.) so the repository just returnsList<Object[]>and the service owns the record types and the parsing. This is the cleaner split — recommend adopting it, but worth a quick confirmation.Theme 5: ADR requirement
From: Markus
Pushing FTS pagination into SQL is a lasting architectural constraint — it determines forever that SENDER/RECEIVER sorts stay in-memory. The project has an active ADR directory (
docs/adr/).Decision needed: Write a short ADR before implementing? Recommended yes — context + decision + alternatives (cap result set with warning, two-query count) + consequences (SENDER/RECEIVER permanent in-memory) takes 20 minutes and avoids re-litigating the trade-off on the PR.