Drops the needsExpert / needs_expert flag end-to-end: DB migration
(V37, never applied), Document entity field, PATCH endpoint, service
method, DTO field, all three queue queries, ExpertBadge component,
i18n key, generated API types, and test fixture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V36 (add_index_transcription_blocks_document_id) was applied to the dev
database during a previous local session but never committed to git.
Flyway checksum mismatch prevented the backend from starting.
- V36__add_index_transcription_blocks_document_id.sql: restored from the
index that already exists in the database (idx_transcription_blocks_document_id)
- V36__add_needs_expert_to_documents.sql → V37__add_needs_expert_to_documents.sql
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a summary_snippet column to findEnrichmentData using ts_headline on
documents.summary, only when the summary's tsvector matches the query.
Expose it via SearchMatchData.summarySnippet / summaryOffsets and render
a "Zusammenfassung" / "Summary" / "Resumen" labelled row in the document
list — identical treatment to the transcription snippet row.
Fixes the case where a document appeared in search results with no
visible match explanation (e.g. searching "frucht" found a document
whose summary mentioned "Früchte").
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace websearch_to_tsquery with a CROSS JOIN LATERAL subquery that
appends :* to each lexeme so prefix matches work (e.g. "furchtb" finds
"furchtbar"). websearch_to_tsquery still handles the safe tokenisation
of user input (stop words, special chars, operators); regexp_replace
then adds :* before to_tsquery re-parses the result.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SearchMatchData gains a 6th field snippetOffsets: List<MatchOffset> so the frontend
can render highlighted terms inside the transcription snippet without {#html}.
- DocumentRepository.findEnrichmentData now calls ts_headline() with chr(1)/chr(2)
sentinels instead of returning raw block text; parseHighlight() strips the sentinels
and produces clean text + MatchOffset list in one pass.
- DocumentService exposes ParsedHighlight and parseHighlight() as public so they can be
called from cross-package integration tests.
- All related tests updated to the new 6-argument SearchMatchData constructor and
to call parseHighlight() for asserting the snippet clean text and offsets.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentService.searchDocuments now returns DocumentSearchResult with matchData
populated from findEnrichmentData. Title highlights are parsed from chr(1)/chr(2)
delimiters into MatchOffset lists; transcription snippet and sender/receiver/tag
match flags are extracted from the same native SQL row.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- should_find_document_by_sender_name — symmetric with existing receiver test
- fts_combined_with_status_filter_excludes_non_matching_status — verifies
hasIds(rankedIds).and(hasStatus(...)) two-phase search works together
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ids.indexOf() scans the full list for each document, giving O(n²) total.
Build a Map<UUID, Integer> once at O(n) and use getOrDefault at O(1) per
document. Behavior is identical; existing tests remain green.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a user explicitly selects DATE sort with a text query active, the
previous code treated it identically to RELEVANCE, silently discarding
the user's sort choice. Remove DATE from the useRankOrder condition so
that explicit DATE sort always goes through the standard JPA sort path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fires the BEFORE UPDATE trigger for every documents row, which recomputes
the tsvector from all currently-linked metadata, blocks, receivers, and tags.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- V34 migration: adds search_vector tsvector column with GIN index
- BEFORE INSERT/UPDATE trigger on documents rebuilds vector from title (A),
summary + transcription_blocks.text (B), sender/receiver names (C),
tag names + location (D) using german FTS config
- AFTER triggers on transcription_blocks, document_receivers, document_tags
touch the parent document row to re-fire the BEFORE UPDATE trigger
- DocumentRepository.findRankedIdsByFts() native query using websearch_to_tsquery
- DocumentFtsTest: 12 integration tests covering stemming, trigger sync,
ranking, stop words, malformed input, receiver and tag search
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin to eclipse-temurin:21.0.10_7-{jdk,jre}-noble for reproducible builds
- Switch -DskipTests to -Dmaven.test.skip=true: skips test compilation entirely,
not just execution — faster and avoids build failures from test-only missing classes
- Add comment on COPY *.jar explaining why the glob is safe (Spring Boot renames
the pre-repackage artifact to .jar.original, leaving only one .jar in target/)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents 111MB of compiled output from being sent to the BuildKit daemon
on cold builds. Only .mvn/, mvnw, pom.xml, and src/ are needed by the
three COPY instructions in the Dockerfile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace runtime mvn spring-boot:run with a proper multi-stage build:
- Stage 1 (builder): compiles JAR with BuildKit cache mount for ~/.m2
- Stage 2 (runtime): eclipse-temurin:21-jre with only the JAR
Removes the backend source volume mount and maven_cache named volume.
Deploy with: docker compose up -d --build
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setCer() was called for recognition training but not for segmentation.
The OCR service now returns cer = 1 - accuracy for segtrain; persist it
so the admin panel can display Fehlerrate for both training types.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed duplicate import of org.mockito.ArgumentMatchers.eq from
DocumentControllerTest (lines 32+35). Added @ApiResponse(responseCode="204")
to patchTrainingLabel so the generated OpenAPI spec matches the actual
NoContent response the controller returns.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrTrainingService.triggerTraining() and triggerSegTraining() held a DB
connection open for the entire ketos training run (potentially minutes),
risking connection pool exhaustion. Replaced class-level @Transactional
with TransactionTemplate for narrow DB writes: guard+create and
result-record each run in their own short transaction; the HTTP call to
the OCR service runs between them with no open connection.
Also replaces blockRepository.findAll().size() with blockRepository.count()
in getTrainingInfo() to avoid loading every block into heap on each poll.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V23 introduced a JSONB check constraint (chk_annotation_polygon_quad)
requiring polygon arrays to have exactly 4 points. V30 introduced a
partial unique index preventing two concurrent RUNNING training runs.
These are DB-level invariants that unit tests cannot verify — five
Testcontainers tests now assert they are correctly applied by Flyway
and enforced by PostgreSQL.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training reloads the Kraken model in-process on the Python service.
The DB-level RUNNING constraint prevents concurrent API calls but
cannot protect against multi-replica deployments. Added explicit
comments in docker-compose.yml and OcrTrainingService to prevent
accidental horizontal scaling. See ADR-001.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A 100-page document at ~10 s/page takes ~17 min on CPU-only hardware,
which could cause the presigned URL to expire mid-OCR job. 1 hour gives
ample headroom for any realistic document size in this archive.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cascade-delete commit (5a5a8b6) added blockRepository.deleteByAnnotationId()
to AnnotationService.deleteAnnotation(), but the test class was not updated to
mock TranscriptionBlockRepository. Mockito injected null, causing deleteAnnotation_succeeds_whenOwner
to throw NPE. Adds the mock, verifies the cascade call, and adds an inOrder test
asserting the block is deleted before the annotation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The DELETE endpoint was returning 500 due to a FK constraint violation.
`deleteAnnotation` now calls `blockRepository.deleteByAnnotationId()`
before removing the annotation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.
- Java export now writes .xml alongside .png (PAGE XML with TextLine,
Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (& < >)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.
Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all MANUAL blocks counted as eligible training data, even ones
where text was filled in by guided OCR but never explicitly reviewed. This
caused segmentation and recognition counts to always match.
Now only reviewed=true blocks qualify for recognition training, so the
counts properly reflect: segments = all drawn annotation boxes,
checked text = only boxes where the user has verified the transcription.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The findSegmentationBlocks query was filtering out blocks with non-empty
text. Segmentation training only needs annotation geometry (polygon/bbox),
not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION
document should count, regardless of whether it has text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Historical letter lines often intersect, so the system must support
overlapping annotation regions. Removed the overlap guard from
createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned
up all tests and frontend error mappings that referenced it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>