fix(training): use KURRENT_RECOGNITION label for sender-based block queries #378

Merged
marcel merged 2 commits from fix/sender-training-use-kurrent-label into main 2026-05-04 15:26:05 +02:00
Owner

Summary

findManualKurrentBlocksByPerson and countManualKurrentBlocksByPerson previously filtered by d.scriptType = 'HANDWRITING_KURRENT'. That field is only populated after OCR runs — which cannot happen before a trained model exists. This creates a chicken-and-egg situation where no training data can ever be exported for a sender.

Both queries now filter on 'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels, consistent with findEligibleKurrentBlocks (the general export query).

Also adds the missing test coverage for the two sender-based queries (previously untested): includes/excludes by label, excludes OCR blocks, excludes wrong sender, and count parity.

## Summary `findManualKurrentBlocksByPerson` and `countManualKurrentBlocksByPerson` previously filtered by `d.scriptType = 'HANDWRITING_KURRENT'`. That field is only populated after OCR runs — which cannot happen before a trained model exists. This creates a chicken-and-egg situation where no training data can ever be exported for a sender. Both queries now filter on `'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels`, consistent with `findEligibleKurrentBlocks` (the general export query). Also adds the missing test coverage for the two sender-based queries (previously untested): includes/excludes by label, excludes OCR blocks, excludes wrong sender, and count parity.
marcel added 1 commit 2026-04-30 10:18:45 +02:00
fix(training): use KURRENT_RECOGNITION label for sender-based block queries
Some checks failed
CI / Unit & Component Tests (push) Failing after 3m32s
CI / OCR Service Tests (push) Successful in 1m2s
CI / Backend Unit Tests (push) Failing after 3m20s
CI / Unit & Component Tests (pull_request) Failing after 3m31s
CI / OCR Service Tests (pull_request) Successful in 38s
CI / Backend Unit Tests (pull_request) Failing after 3m4s
c905b81fd3
scriptType is only set after OCR runs, which can't happen before we have
a trained model. Both sender-based queries now filter on the training label
instead, consistent with findEligibleKurrentBlocks.

Also adds missing test coverage for findManualKurrentBlocksByPerson and
countManualKurrentBlocksByPerson (4 cases + count parity check).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

The fix is clean and surgically targeted. Two-line production change, 89 lines of test coverage. That's the right ratio.

What I checked

Naming and intent: 'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels reads exactly as intended — it makes the filter criterion explicit rather than inferring it from a runtime-only scriptType field. No ambiguity.

Function responsibility: Both countManualKurrentBlocksByPerson and findManualKurrentBlocksByPerson now match the general findEligibleKurrentBlocks filter logic. Consistency in the repository layer is good — one mental model for "what qualifies as a Kurrent document."

TDD evidence: The five new test methods cover the important cases for findManualKurrentBlocksByPerson: include with label, exclude without label, exclude OCR source, exclude wrong sender. And countManualKurrentBlocksByPerson_matchesFindResult verifies count/list parity. These tests were clearly written to describe real behaviors, not to pad coverage.

Minor suggestion (non-blocking)

The HashSet construction in every test body is slightly verbose:

.trainingLabels(new java.util.HashSet<>(Set.of(TrainingLabel.KURRENT_RECOGNITION)))

The java.util. qualification suggests the import is missing. If HashSet is available via the wildcard import org.raddatz.familienarchiv.model.*, it would be cleaner to either add an explicit import or use a factory helper. Not a blocker — this is a test file and the intent is clear.

Also, countManualKurrentBlocksByPerson_matchesFindResult verifies the count equals 2 (the total block count), but does not separately verify the find result has the same size. The name "matchesFindResult" implies a parity check — a one-liner assertThat(count).isEqualTo(blockRepository.findManualKurrentBlocksByPerson(sender.getId()).size()) would make that contract explicit. Again, non-blocking.

Confidence

The production fix is correct and the test suite would have caught the original bug as a regression. No concerns.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** The fix is clean and surgically targeted. Two-line production change, 89 lines of test coverage. That's the right ratio. ### What I checked **Naming and intent**: `'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels` reads exactly as intended — it makes the filter criterion explicit rather than inferring it from a runtime-only `scriptType` field. No ambiguity. **Function responsibility**: Both `countManualKurrentBlocksByPerson` and `findManualKurrentBlocksByPerson` now match the general `findEligibleKurrentBlocks` filter logic. Consistency in the repository layer is good — one mental model for "what qualifies as a Kurrent document." **TDD evidence**: The five new test methods cover the important cases for `findManualKurrentBlocksByPerson`: include with label, exclude without label, exclude OCR source, exclude wrong sender. And `countManualKurrentBlocksByPerson_matchesFindResult` verifies count/list parity. These tests were clearly written to describe real behaviors, not to pad coverage. ### Minor suggestion (non-blocking) The `HashSet` construction in every test body is slightly verbose: ```java .trainingLabels(new java.util.HashSet<>(Set.of(TrainingLabel.KURRENT_RECOGNITION))) ``` The `java.util.` qualification suggests the import is missing. If `HashSet` is available via the wildcard `import org.raddatz.familienarchiv.model.*`, it would be cleaner to either add an explicit import or use a factory helper. Not a blocker — this is a test file and the intent is clear. Also, `countManualKurrentBlocksByPerson_matchesFindResult` verifies the count equals 2 (the total block count), but does not separately verify the `find` result has the same size. The name "matchesFindResult" implies a parity check — a one-liner `assertThat(count).isEqualTo(blockRepository.findManualKurrentBlocksByPerson(sender.getId()).size())` would make that contract explicit. Again, non-blocking. ### Confidence The production fix is correct and the test suite would have caught the original bug as a regression. No concerns.
Author
Owner

🏛️ Markus Keller — Senior Application Architect

Verdict: Approved

This is a clean bug fix that resolves a genuine chicken-and-egg problem in the training data pipeline. No architectural concerns.

What I examined

Layer boundaries: The fix lives entirely in TranscriptionBlockRepository — a JPQL query change. No service boundary is crossed, no new dependency is introduced. The fix does not reach outside the repository's domain.

Query consistency: The two sender-based queries (countManualKurrentBlocksByPerson, findManualKurrentBlocksByPerson) now align with findEligibleKurrentBlocks. All three use 'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels as the filter criterion. Previously, the sender-based queries used d.scriptType = 'HANDWRITING_KURRENT' — a field that is only populated after OCR runs, meaning those queries could never return training candidates for a sender before a model existed. This is the right fix.

Data model integrity: The trainingLabels collection (presumably an @ElementCollection on Document) correctly represents "this document has been enrolled for Kurrent training" as an explicit human-labelled decision, decoupled from the OCR pipeline output. Using it as the filter criterion is semantically superior.

One observation (non-blocking)

The countManualKurrentBlocksByPerson and findManualKurrentBlocksByPerson queries do not join DocumentAnnotation the way findEligibleKurrentBlocks does. The eligible-blocks query has:

JOIN DocumentAnnotation a ON a.id = b.annotationId
JOIN Document d ON d.id = b.documentId

Whereas the sender-based queries skip the annotation join:

JOIN Document d ON d.id = b.documentId

This is not a bug — the annotation join in findEligibleKurrentBlocks appears to be a fetch join for the annotation entity. The sender-based queries do not need the annotation data and the join to Document is sufficient. Just worth noting for a future reader of this code, since the asymmetry between the three queries might prompt questions.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** This is a clean bug fix that resolves a genuine chicken-and-egg problem in the training data pipeline. No architectural concerns. ### What I examined **Layer boundaries**: The fix lives entirely in `TranscriptionBlockRepository` — a JPQL query change. No service boundary is crossed, no new dependency is introduced. The fix does not reach outside the repository's domain. **Query consistency**: The two sender-based queries (`countManualKurrentBlocksByPerson`, `findManualKurrentBlocksByPerson`) now align with `findEligibleKurrentBlocks`. All three use `'KURRENT_RECOGNITION' MEMBER OF d.trainingLabels` as the filter criterion. Previously, the sender-based queries used `d.scriptType = 'HANDWRITING_KURRENT'` — a field that is only populated after OCR runs, meaning those queries could never return training candidates for a sender before a model existed. This is the right fix. **Data model integrity**: The `trainingLabels` collection (presumably an `@ElementCollection` on `Document`) correctly represents "this document has been enrolled for Kurrent training" as an explicit human-labelled decision, decoupled from the OCR pipeline output. Using it as the filter criterion is semantically superior. ### One observation (non-blocking) The `countManualKurrentBlocksByPerson` and `findManualKurrentBlocksByPerson` queries do not join `DocumentAnnotation` the way `findEligibleKurrentBlocks` does. The eligible-blocks query has: ```jpql JOIN DocumentAnnotation a ON a.id = b.annotationId JOIN Document d ON d.id = b.documentId ``` Whereas the sender-based queries skip the annotation join: ```jpql JOIN Document d ON d.id = b.documentId ``` This is not a bug — the annotation join in `findEligibleKurrentBlocks` appears to be a fetch join for the annotation entity. The sender-based queries do not need the annotation data and the join to `Document` is sufficient. Just worth noting for a future reader of this code, since the asymmetry between the three queries might prompt questions.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved

No security-relevant issues found. This is a pure backend repository query change with no authentication, authorization, or user-input surface.

What I checked

Injection surface: Both changed queries use JPQL with named parameters (@Param("personId") UUID personId). UUIDs cannot carry SQL injection payloads. The MEMBER OF clause operates on a literal string constant 'KURRENT_RECOGNITION' that comes from the source code, not user input. Clean.

Data exposure: The queries return TranscriptionBlock entities scoped by sender. No new data cross-domain access is introduced. The fix narrows the filter criterion — if anything, this is more conservative than before.

Permission enforcement: The training data export endpoints that call these queries should already be protected by @RequirePermission. This PR does not touch the controller or service layer, so I cannot confirm the authorization posture here — but that is not a change introduced by this PR.

Observation (informational)

The trainingLabels enrollment mechanism (KURRENT_RECOGNITION MEMBER OF d.trainingLabels) is a better security posture than inferring training eligibility from a machine-generated scriptType field. The old behaviour could have exported blocks from documents where OCR ran but no human confirmed Kurrent enrollment, creating a training data quality (and potentially a data scope) issue. The fix makes enrollment intent explicit.

## 🔒 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** No security-relevant issues found. This is a pure backend repository query change with no authentication, authorization, or user-input surface. ### What I checked **Injection surface**: Both changed queries use JPQL with named parameters (`@Param("personId") UUID personId`). UUIDs cannot carry SQL injection payloads. The `MEMBER OF` clause operates on a literal string constant `'KURRENT_RECOGNITION'` that comes from the source code, not user input. Clean. **Data exposure**: The queries return `TranscriptionBlock` entities scoped by sender. No new data cross-domain access is introduced. The fix narrows the filter criterion — if anything, this is more conservative than before. **Permission enforcement**: The training data export endpoints that call these queries should already be protected by `@RequirePermission`. This PR does not touch the controller or service layer, so I cannot confirm the authorization posture here — but that is not a change introduced by this PR. ### Observation (informational) The `trainingLabels` enrollment mechanism (`KURRENT_RECOGNITION MEMBER OF d.trainingLabels`) is a better security posture than inferring training eligibility from a machine-generated `scriptType` field. The old behaviour could have exported blocks from documents where OCR ran but no human confirmed Kurrent enrollment, creating a training data quality (and potentially a data scope) issue. The fix makes enrollment intent explicit.
Author
Owner

🧪 Sara Holt — Senior QA Engineer

Verdict: Approved

Strong test coverage for a previously untested pair of queries. The new tests are well-structured and the critical behavioural axes are covered.

What I checked

Test infrastructure: @DataJpaTest + @AutoConfigureTestDatabase(replace = NONE) + PostgresContainerConfig is the correct setup — real PostgreSQL via Testcontainers, not H2. No concerns there.

Coverage of findManualKurrentBlocksByPerson:

  • Include: document with KURRENT_RECOGNITION label + MANUAL block
  • Exclude: document without label
  • Exclude: OCR block (source filter)
  • Exclude: different sender

Coverage of countManualKurrentBlocksByPerson:

  • Count parity with two blocks

Suggestions (non-blocking)

  1. countManualKurrentBlocksByPerson is under-tested for exclusion cases. There are four exclusion tests for find but only one count test (the happy path). The count test named _matchesFindResult would be more useful if it actually called both methods and compared their results, making the parity contract explicit:

    long count = blockRepository.countManualKurrentBlocksByPerson(sender.getId());
    List<TranscriptionBlock> found = blockRepository.findManualKurrentBlocksByPerson(sender.getId());
    assertThat(count).isEqualTo(found.size());
    

    As written, it only checks that the count returns 2 (the total saved), which passes regardless of whether find returns the same 2.

  2. Missing: countManualKurrentBlocksByPerson returns 0 when no blocks match. A zero-count test would ensure the query does not throw on empty results and round-trips correctly with the service layer. This is a low-effort gap to close.

  3. @BeforeEach creates shared documents that new sender-based tests don't use. The shared setup creates kurrentDocId and typewriterDocId in @BeforeEach, but none of the five new tests reference those IDs. Each new test creates its own Person and Document inline. This is fine for isolation, but it means the shared setup is silent noise for the new tests. This is minor and common in growing test classes — worth noting if the class grows further.

Verdict

The important cases are covered. The tests use proper infrastructure (real Postgres, real Flyway). The suggestions above are improvements, not blockers.

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ✅ Approved** Strong test coverage for a previously untested pair of queries. The new tests are well-structured and the critical behavioural axes are covered. ### What I checked **Test infrastructure**: `@DataJpaTest` + `@AutoConfigureTestDatabase(replace = NONE)` + `PostgresContainerConfig` is the correct setup — real PostgreSQL via Testcontainers, not H2. No concerns there. **Coverage of `findManualKurrentBlocksByPerson`**: - ✅ Include: document with `KURRENT_RECOGNITION` label + MANUAL block - ✅ Exclude: document without label - ✅ Exclude: OCR block (source filter) - ✅ Exclude: different sender **Coverage of `countManualKurrentBlocksByPerson`**: - ✅ Count parity with two blocks ### Suggestions (non-blocking) 1. **`countManualKurrentBlocksByPerson` is under-tested for exclusion cases.** There are four exclusion tests for `find` but only one count test (the happy path). The count test named `_matchesFindResult` would be more useful if it actually called both methods and compared their results, making the parity contract explicit: ```java long count = blockRepository.countManualKurrentBlocksByPerson(sender.getId()); List<TranscriptionBlock> found = blockRepository.findManualKurrentBlocksByPerson(sender.getId()); assertThat(count).isEqualTo(found.size()); ``` As written, it only checks that the count returns 2 (the total saved), which passes regardless of whether `find` returns the same 2. 2. **Missing: `countManualKurrentBlocksByPerson` returns 0 when no blocks match.** A zero-count test would ensure the query does not throw on empty results and round-trips correctly with the service layer. This is a low-effort gap to close. 3. **`@BeforeEach` creates shared documents that new sender-based tests don't use.** The shared setup creates `kurrentDocId` and `typewriterDocId` in `@BeforeEach`, but none of the five new tests reference those IDs. Each new test creates its own `Person` and `Document` inline. This is fine for isolation, but it means the shared setup is silent noise for the new tests. This is minor and common in growing test classes — worth noting if the class grows further. ### Verdict The important cases are covered. The tests use proper infrastructure (real Postgres, real Flyway). The suggestions above are improvements, not blockers.
Author
Owner

⚙️ Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved

No infrastructure or CI changes in this PR. This is a backend Java fix with test coverage — nothing for me to flag from a platform perspective.

What I checked

  • No changes to docker-compose.yml, CI workflow files, or any infrastructure configuration
  • No new dependencies added to pom.xml
  • The test setup uses @DataJpaTest + PostgresContainerConfig (Testcontainers) — these already run in CI and will pick up the new tests automatically with no pipeline changes needed
  • No new environment variables, secrets, or external service calls introduced

Observation

The new tests increase the @DataJpaTest suite by 5 methods. Testcontainers starts one PostgreSQL container per test class (via the shared container pattern), so the CI overhead is a few additional JPQL executions within an already-running container — negligible.

LGTM from an infrastructure standpoint.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** No infrastructure or CI changes in this PR. This is a backend Java fix with test coverage — nothing for me to flag from a platform perspective. ### What I checked - No changes to `docker-compose.yml`, CI workflow files, or any infrastructure configuration - No new dependencies added to `pom.xml` - The test setup uses `@DataJpaTest` + `PostgresContainerConfig` (Testcontainers) — these already run in CI and will pick up the new tests automatically with no pipeline changes needed - No new environment variables, secrets, or external service calls introduced ### Observation The new tests increase the `@DataJpaTest` suite by 5 methods. Testcontainers starts one PostgreSQL container per test class (via the shared container pattern), so the CI overhead is a few additional JPQL executions within an already-running container — negligible. LGTM from an infrastructure standpoint.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Verdict: Approved

This PR is a backend repository fix with no frontend, UI, or UX changes. No design review concerns.

What I checked

  • No Svelte components changed
  • No routing, page layout, or interaction patterns modified
  • No i18n strings added or removed
  • No changes to Tailwind classes, brand tokens, or accessibility markup

The fix resolves a data pipeline issue in the OCR training flow. The user-facing impact — if any — would be that the "sender-based training data export" now returns correct results, which is an invisible improvement to end users. The training admin UI (if one exists) would continue functioning as before, now with accurate data.

Nothing to flag from my review domain.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ✅ Approved** This PR is a backend repository fix with no frontend, UI, or UX changes. No design review concerns. ### What I checked - No Svelte components changed - No routing, page layout, or interaction patterns modified - No i18n strings added or removed - No changes to Tailwind classes, brand tokens, or accessibility markup The fix resolves a data pipeline issue in the OCR training flow. The user-facing impact — if any — would be that the "sender-based training data export" now returns correct results, which is an invisible improvement to end users. The training admin UI (if one exists) would continue functioning as before, now with accurate data. Nothing to flag from my review domain.
Author
Owner

📋 Elicit — Requirements Engineer

Verdict: Approved

The PR description accurately describes both the root cause and the fix. The requirements alignment is solid.

What I verified

Root cause documented: The PR body clearly states the chicken-and-egg problem: scriptType is only populated after OCR runs, but OCR cannot run before a trained model exists. The consequence — no training data can ever be exported for a sender — is correctly identified.

Fix is traceable to a requirement: The stated requirement is that sender-based block queries should behave consistently with the general findEligibleKurrentBlocks export query. Both now use KURRENT_RECOGNITION MEMBER OF d.trainingLabels. Consistency is verifiable.

Acceptance criteria coverage: The five new tests map to the key behavioural requirements:

  • Include when enrolled →
  • Exclude when not enrolled →
  • Exclude non-manual sources →
  • Exclude wrong sender →
  • Count parity →

One open question (non-blocking)

The PR fixes the query filter for sender-based block retrieval, but it is not stated whether this bug caused silent data loss in production (i.e., were training exports silently empty for all senders?) or whether the affected endpoint was simply not yet exercised. Understanding the blast radius would be useful context in the PR description for future incident triage. This is documentation feedback, not a code concern.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** The PR description accurately describes both the root cause and the fix. The requirements alignment is solid. ### What I verified **Root cause documented**: The PR body clearly states the chicken-and-egg problem: `scriptType` is only populated after OCR runs, but OCR cannot run before a trained model exists. The consequence — no training data can ever be exported for a sender — is correctly identified. **Fix is traceable to a requirement**: The stated requirement is that sender-based block queries should behave consistently with the general `findEligibleKurrentBlocks` export query. Both now use `KURRENT_RECOGNITION MEMBER OF d.trainingLabels`. Consistency is verifiable. **Acceptance criteria coverage**: The five new tests map to the key behavioural requirements: - Include when enrolled → ✅ - Exclude when not enrolled → ✅ - Exclude non-manual sources → ✅ - Exclude wrong sender → ✅ - Count parity → ✅ ### One open question (non-blocking) The PR fixes the query filter for sender-based block retrieval, but it is not stated whether this bug caused silent data loss in production (i.e., were training exports silently empty for all senders?) or whether the affected endpoint was simply not yet exercised. Understanding the blast radius would be useful context in the PR description for future incident triage. This is documentation feedback, not a code concern.
marcel added 1 commit 2026-05-04 15:13:21 +02:00
test(training): strengthen TrainingBlockQueryTest assertions
Some checks failed
CI / Backend Unit Tests (push) Failing after 3m5s
CI / Unit & Component Tests (pull_request) Failing after 3m22s
CI / OCR Service Tests (pull_request) Successful in 40s
CI / Backend Unit Tests (pull_request) Failing after 3m16s
CI / Unit & Component Tests (push) Failing after 4m2s
CI / OCR Service Tests (push) Successful in 37s
6399321d0e
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit 32e4e30e40 into main 2026-05-04 15:26:05 +02:00
marcel deleted branch fix/sender-training-use-kurrent-label 2026-05-04 15:26:07 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#378