fix(training): use KURRENT_RECOGNITION label for sender-based block queries #378
Reference in New Issue
Block a user
Delete Branch "fix/sender-training-use-kurrent-label"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
findManualKurrentBlocksByPersonandcountManualKurrentBlocksByPersonpreviously filtered byd.scriptType = 'HANDWRITING_KURRENT'. That field is only populated after OCR runs — which cannot happen before a trained model exists. This creates a chicken-and-egg situation where no training data can ever be exported for a sender.Both queries now filter on
'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels, consistent withfindEligibleKurrentBlocks(the general export query).Also adds the missing test coverage for the two sender-based queries (previously untested): includes/excludes by label, excludes OCR blocks, excludes wrong sender, and count parity.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
The fix is clean and surgically targeted. Two-line production change, 89 lines of test coverage. That's the right ratio.
What I checked
Naming and intent:
'KURRENT_RECOGNITION' MEMBER OF d.trainingLabelsreads exactly as intended — it makes the filter criterion explicit rather than inferring it from a runtime-onlyscriptTypefield. No ambiguity.Function responsibility: Both
countManualKurrentBlocksByPersonandfindManualKurrentBlocksByPersonnow match the generalfindEligibleKurrentBlocksfilter logic. Consistency in the repository layer is good — one mental model for "what qualifies as a Kurrent document."TDD evidence: The five new test methods cover the important cases for
findManualKurrentBlocksByPerson: include with label, exclude without label, exclude OCR source, exclude wrong sender. AndcountManualKurrentBlocksByPerson_matchesFindResultverifies count/list parity. These tests were clearly written to describe real behaviors, not to pad coverage.Minor suggestion (non-blocking)
The
HashSetconstruction in every test body is slightly verbose:The
java.util.qualification suggests the import is missing. IfHashSetis available via the wildcardimport org.raddatz.familienarchiv.model.*, it would be cleaner to either add an explicit import or use a factory helper. Not a blocker — this is a test file and the intent is clear.Also,
countManualKurrentBlocksByPerson_matchesFindResultverifies the count equals 2 (the total block count), but does not separately verify thefindresult has the same size. The name "matchesFindResult" implies a parity check — a one-linerassertThat(count).isEqualTo(blockRepository.findManualKurrentBlocksByPerson(sender.getId()).size())would make that contract explicit. Again, non-blocking.Confidence
The production fix is correct and the test suite would have caught the original bug as a regression. No concerns.
🏛️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
This is a clean bug fix that resolves a genuine chicken-and-egg problem in the training data pipeline. No architectural concerns.
What I examined
Layer boundaries: The fix lives entirely in
TranscriptionBlockRepository— a JPQL query change. No service boundary is crossed, no new dependency is introduced. The fix does not reach outside the repository's domain.Query consistency: The two sender-based queries (
countManualKurrentBlocksByPerson,findManualKurrentBlocksByPerson) now align withfindEligibleKurrentBlocks. All three use'KURRENT_RECOGNITION' MEMBER OF d.trainingLabelsas the filter criterion. Previously, the sender-based queries usedd.scriptType = 'HANDWRITING_KURRENT'— a field that is only populated after OCR runs, meaning those queries could never return training candidates for a sender before a model existed. This is the right fix.Data model integrity: The
trainingLabelscollection (presumably an@ElementCollectiononDocument) correctly represents "this document has been enrolled for Kurrent training" as an explicit human-labelled decision, decoupled from the OCR pipeline output. Using it as the filter criterion is semantically superior.One observation (non-blocking)
The
countManualKurrentBlocksByPersonandfindManualKurrentBlocksByPersonqueries do not joinDocumentAnnotationthe wayfindEligibleKurrentBlocksdoes. The eligible-blocks query has:Whereas the sender-based queries skip the annotation join:
This is not a bug — the annotation join in
findEligibleKurrentBlocksappears to be a fetch join for the annotation entity. The sender-based queries do not need the annotation data and the join toDocumentis sufficient. Just worth noting for a future reader of this code, since the asymmetry between the three queries might prompt questions.🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
No security-relevant issues found. This is a pure backend repository query change with no authentication, authorization, or user-input surface.
What I checked
Injection surface: Both changed queries use JPQL with named parameters (
@Param("personId") UUID personId). UUIDs cannot carry SQL injection payloads. TheMEMBER OFclause operates on a literal string constant'KURRENT_RECOGNITION'that comes from the source code, not user input. Clean.Data exposure: The queries return
TranscriptionBlockentities scoped by sender. No new data cross-domain access is introduced. The fix narrows the filter criterion — if anything, this is more conservative than before.Permission enforcement: The training data export endpoints that call these queries should already be protected by
@RequirePermission. This PR does not touch the controller or service layer, so I cannot confirm the authorization posture here — but that is not a change introduced by this PR.Observation (informational)
The
trainingLabelsenrollment mechanism (KURRENT_RECOGNITION MEMBER OF d.trainingLabels) is a better security posture than inferring training eligibility from a machine-generatedscriptTypefield. The old behaviour could have exported blocks from documents where OCR ran but no human confirmed Kurrent enrollment, creating a training data quality (and potentially a data scope) issue. The fix makes enrollment intent explicit.🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
Strong test coverage for a previously untested pair of queries. The new tests are well-structured and the critical behavioural axes are covered.
What I checked
Test infrastructure:
@DataJpaTest+@AutoConfigureTestDatabase(replace = NONE)+PostgresContainerConfigis the correct setup — real PostgreSQL via Testcontainers, not H2. No concerns there.Coverage of
findManualKurrentBlocksByPerson:KURRENT_RECOGNITIONlabel + MANUAL blockCoverage of
countManualKurrentBlocksByPerson:Suggestions (non-blocking)
countManualKurrentBlocksByPersonis under-tested for exclusion cases. There are four exclusion tests forfindbut only one count test (the happy path). The count test named_matchesFindResultwould be more useful if it actually called both methods and compared their results, making the parity contract explicit:As written, it only checks that the count returns 2 (the total saved), which passes regardless of whether
findreturns the same 2.Missing:
countManualKurrentBlocksByPersonreturns 0 when no blocks match. A zero-count test would ensure the query does not throw on empty results and round-trips correctly with the service layer. This is a low-effort gap to close.@BeforeEachcreates shared documents that new sender-based tests don't use. The shared setup createskurrentDocIdandtypewriterDocIdin@BeforeEach, but none of the five new tests reference those IDs. Each new test creates its ownPersonandDocumentinline. This is fine for isolation, but it means the shared setup is silent noise for the new tests. This is minor and common in growing test classes — worth noting if the class grows further.Verdict
The important cases are covered. The tests use proper infrastructure (real Postgres, real Flyway). The suggestions above are improvements, not blockers.
⚙️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
No infrastructure or CI changes in this PR. This is a backend Java fix with test coverage — nothing for me to flag from a platform perspective.
What I checked
docker-compose.yml, CI workflow files, or any infrastructure configurationpom.xml@DataJpaTest+PostgresContainerConfig(Testcontainers) — these already run in CI and will pick up the new tests automatically with no pipeline changes neededObservation
The new tests increase the
@DataJpaTestsuite by 5 methods. Testcontainers starts one PostgreSQL container per test class (via the shared container pattern), so the CI overhead is a few additional JPQL executions within an already-running container — negligible.LGTM from an infrastructure standpoint.
🎨 Leonie Voss — UI/UX Design Lead
Verdict: ✅ Approved
This PR is a backend repository fix with no frontend, UI, or UX changes. No design review concerns.
What I checked
The fix resolves a data pipeline issue in the OCR training flow. The user-facing impact — if any — would be that the "sender-based training data export" now returns correct results, which is an invisible improvement to end users. The training admin UI (if one exists) would continue functioning as before, now with accurate data.
Nothing to flag from my review domain.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
The PR description accurately describes both the root cause and the fix. The requirements alignment is solid.
What I verified
Root cause documented: The PR body clearly states the chicken-and-egg problem:
scriptTypeis only populated after OCR runs, but OCR cannot run before a trained model exists. The consequence — no training data can ever be exported for a sender — is correctly identified.Fix is traceable to a requirement: The stated requirement is that sender-based block queries should behave consistently with the general
findEligibleKurrentBlocksexport query. Both now useKURRENT_RECOGNITION MEMBER OF d.trainingLabels. Consistency is verifiable.Acceptance criteria coverage: The five new tests map to the key behavioural requirements:
One open question (non-blocking)
The PR fixes the query filter for sender-based block retrieval, but it is not stated whether this bug caused silent data loss in production (i.e., were training exports silently empty for all senders?) or whether the affected endpoint was simply not yet exercised. Understanding the blast radius would be useful context in the PR description for future incident triage. This is documentation feedback, not a code concern.