Commit Graph

238 Commits

Author SHA1 Message Date
Marcel
1177de2f95 feat(#240): backend for Mission Control Strip — queue endpoints + expert flag
Adds the server-side foundation for the dashboard transcription widget:

- V36 migration: needs_expert BOOLEAN NOT NULL DEFAULT FALSE on documents
- Document entity: needsExpert field (@Schema required)
- DocumentRepository: 4 native queries — segmentation queue, transcription
  queue, ready-to-read queue (seeded weekly shuffle sort), weekly pulse stats
- TranscriptionQueueService: maps Object[] rows to typed DTOs, handles
  PostgreSQL type variations (UUID/String, Date/LocalDate, Number/BigDecimal)
- TranscriptionQueueController: GET /api/transcription/{segmentation-queue,
  transcription-queue, ready-to-read, weekly-stats} — all guarded by READ_ALL
- DocumentService + DocumentController: PATCH /api/documents/{id}/needs-expert
  toggles the expert flag (WRITE_ALL required)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:59:17 +02:00
Marcel
305f95a572 test(search): add sender name FTS coverage and combined filter test
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1m57s
CI / Backend Unit Tests (pull_request) Failing after 3m0s
- should_find_document_by_sender_name — symmetric with existing receiver test
- fts_combined_with_status_filter_excludes_non_matching_status — verifies
  hasIds(rankedIds).and(hasStatus(...)) two-phase search works together

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:30 +02:00
Marcel
43595aeb8a refactor(search): replace O(n²) indexOf with HashMap for rank ordering
ids.indexOf() scans the full list for each document, giving O(n²) total.
Build a Map<UUID, Integer> once at O(n) and use getOrDefault at O(1) per
document. Behavior is identical; existing tests remain green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:30 +02:00
Marcel
947d8aeb6c fix(search): respect DATE sort when text is present — do not override with relevance
When a user explicitly selects DATE sort with a text query active, the
previous code treated it identically to RELEVANCE, silently discarding
the user's sort choice. Remove DATE from the useRankOrder condition so
that explicit DATE sort always goes through the standard JPA sort path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:30 +02:00
Marcel
7ec3e6170d feat(fts): backfill search_vector for all existing documents (V35)
Fires the BEFORE UPDATE trigger for every documents row, which recomputes
the tsvector from all currently-linked metadata, blocks, receivers, and tags.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:30 +02:00
Marcel
7d456d8e8b feat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort
- DocumentSort: add RELEVANCE enum value
- DocumentSpecifications: remove hasText() ILIKE, add hasIds(List<UUID>)
  for FTS-pre-filtered ID sets
- DocumentService.searchDocuments(): FTS two-phase path — findRankedIdsByFts()
  returns ranked UUIDs, hasIds() narrows subsequent Specification query,
  in-memory re-sort preserves rank order; RELEVANCE is the default when
  text is present and no explicit non-relevance sort is requested
- DocumentSpecificationsTest: remove hasText() tests (Specification removed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:30 +02:00
Marcel
24530cf85b feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34)
- V34 migration: adds search_vector tsvector column with GIN index
- BEFORE INSERT/UPDATE trigger on documents rebuilds vector from title (A),
  summary + transcription_blocks.text (B), sender/receiver names (C),
  tag names + location (D) using german FTS config
- AFTER triggers on transcription_blocks, document_receivers, document_tags
  touch the parent document row to re-fire the BEFORE UPDATE trigger
- DocumentRepository.findRankedIdsByFts() native query using websearch_to_tsquery
- DocumentFtsTest: 12 integration tests covering stemming, trigger sync,
  ranking, stop words, malformed input, receiver and tag search

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:35:16 +02:00
Marcel
48223d5a3d devops(backend): pin eclipse-temurin tags, skip test compilation, document jar glob
- Pin to eclipse-temurin:21.0.10_7-{jdk,jre}-noble for reproducible builds
- Switch -DskipTests to -Dmaven.test.skip=true: skips test compilation entirely,
  not just execution — faster and avoids build failures from test-only missing classes
- Add comment on COPY *.jar explaining why the glob is safe (Spring Boot renames
  the pre-repackage artifact to .jar.original, leaving only one .jar in target/)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:33:03 +02:00
Marcel
04069c0286 devops(backend): add .dockerignore to exclude target/ from build context
Prevents 111MB of compiled output from being sent to the BuildKit daemon
on cold builds. Only .mvn/, mvnw, pom.xml, and src/ are needed by the
three COPY instructions in the Dockerfile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:33:03 +02:00
Marcel
3c46d820ad devops(backend): switch to multi-stage Docker build
Replace runtime mvn spring-boot:run with a proper multi-stage build:
- Stage 1 (builder): compiles JAR with BuildKit cache mount for ~/.m2
- Stage 2 (runtime): eclipse-temurin:21-jre with only the JAR

Removes the backend source volume mount and maven_cache named volume.
Deploy with: docker compose up -d --build

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:33:03 +02:00
Marcel
81da127381 refactor(ocr): rename findTop5 to findTop10 for headroom as frontend shows 3 by default
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
f206c0b9e9 test(ocr): add unit tests for triggerSegTraining() — conflict, threshold, happy path, failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
15e532eb96 refactor(ocr): extract assertNoRunningTraining() to eliminate duplicate guard
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
b83465020a fix(backend): store error rate for segmentation training runs
setCer() was called for recognition training but not for segmentation.
The OCR service now returns cer = 1 - accuracy for segtrain; persist it
so the admin panel can display Fehlerrate for both training types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:17:53 +02:00
Marcel
72700bd28f test(annotations): add Testcontainers integration tests for V33 chk_annotation_bounds [B1]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:36:37 +02:00
Marcel
40c8f548db docs(annotations): fix ANNOTATION_UPDATE_FAILED Javadoc to reflect 400 status [M3]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:34:55 +02:00
Marcel
a19faa3806 feat(annotations): add @Slf4j and DataIntegrityViolationException catch to updateAnnotation [M2]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:34:03 +02:00
Marcel
f00b470928 test(annotations): add failing test for DataIntegrityViolationException defense [M2 red]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:33:43 +02:00
Marcel
65d606d8bb test(annotations): add missing height and x boundary validation tests [M4]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:31:07 +02:00
Marcel
4d3207fc27 test(annotations): verify save() is called in updateAnnotation test [M5]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 14:30:50 +02:00
Marcel
ff231db671 feat(annotations): add PATCH endpoint for annotation resize/move
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:42:08 +02:00
Marcel
1558881c01 feat(annotations): add updateAnnotation service method with partial-update DTO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:39:50 +02:00
Marcel
26c7181ba4 feat(annotations): add ANNOTATION_UPDATE_FAILED error code
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:38:33 +02:00
Marcel
f76a6c0ee5 migration(annotations): add chk_annotation_bounds CHECK constraint (V33)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:38:11 +02:00
Marcel
22ee3dce68 fix(api): remove duplicate import and align patchTrainingLabel OpenAPI response to 204
Removed duplicate import of org.mockito.ArgumentMatchers.eq from
DocumentControllerTest (lines 32+35). Added @ApiResponse(responseCode="204")
to patchTrainingLabel so the generated OpenAPI spec matches the actual
NoContent response the controller returns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 10:07:41 +02:00
Marcel
dc283ba271 fix(training): remove @Transactional from triggerTraining to avoid holding DB connection during OCR HTTP call
OcrTrainingService.triggerTraining() and triggerSegTraining() held a DB
connection open for the entire ketos training run (potentially minutes),
risking connection pool exhaustion. Replaced class-level @Transactional
with TransactionTemplate for narrow DB writes: guard+create and
result-record each run in their own short transaction; the HTTP call to
the OCR service runs between them with no open connection.

Also replaces blockRepository.findAll().size() with blockRepository.count()
in getTrainingInfo() to avoid loading every block into heap on each poll.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 09:59:12 +02:00
Marcel
7b79dc105b test(migrations): add Testcontainers integration tests for V23 + V30 constraints
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
V23 introduced a JSONB check constraint (chk_annotation_polygon_quad)
requiring polygon arrays to have exactly 4 points. V30 introduced a
partial unique index preventing two concurrent RUNNING training runs.
These are DB-level invariants that unit tests cannot verify — five
Testcontainers tests now assert they are correctly applied by Flyway
and enforced by PostgreSQL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 23:07:17 +02:00
Marcel
287920a982 docs(ocr): document single-node constraint for OCR training
Training reloads the Kraken model in-process on the Python service.
The DB-level RUNNING constraint prevents concurrent API calls but
cannot protect against multi-replica deployments. Added explicit
comments in docker-compose.yml and OcrTrainingService to prevent
accidental horizontal scaling. See ADR-001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 23:01:45 +02:00
Marcel
2b355e748e fix(ocr): increase presigned URL TTL from 15 min to 1 hour
A 100-page document at ~10 s/page takes ~17 min on CPU-only hardware,
which could cause the presigned URL to expire mid-OCR job. 1 hour gives
ample headroom for any realistic document size in this archive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 23:00:52 +02:00
Marcel
2181fe0b50 test(annotations): fix AnnotationServiceTest — add missing TranscriptionBlockRepository mock
The cascade-delete commit (5a5a8b6) added blockRepository.deleteByAnnotationId()
to AnnotationService.deleteAnnotation(), but the test class was not updated to
mock TranscriptionBlockRepository. Mockito injected null, causing deleteAnnotation_succeeds_whenOwner
to throw NPE. Adds the mock, verifies the cascade call, and adds an inOrder test
asserting the block is deleted before the annotation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 23:00:09 +02:00
Marcel
5a5a8b6e5c fix(annotations): cascade-delete transcription block when annotation is deleted
Some checks failed
CI / Unit & Component Tests (push) Failing after 1s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
The DELETE endpoint was returning 500 due to a FK constraint violation.
`deleteAnnotation` now calls `blockRepository.deleteByAnnotationId()`
before removing the annotation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 22:31:02 +02:00
Marcel
49c9022285 fix(training): switch to PAGE XML format for kurrent recognition training
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.

- Java export now writes .xml alongside .png (PAGE XML with TextLine,
  Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (&amp; &lt; &gt;)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
  OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:45:08 +02:00
Marcel
22954f348a feat(training): track and display CER per training run
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.

Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 19:01:10 +02:00
Marcel
a99afef319 fix(training): only count reviewed blocks as checked text for recognition
Previously all MANUAL blocks counted as eligible training data, even ones
where text was filled in by guided OCR but never explicitly reviewed. This
caused segmentation and recognition counts to always match.

Now only reviewed=true blocks qualify for recognition training, so the
counts properly reflect: segments = all drawn annotation boxes,
checked text = only boxes where the user has verified the transcription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 18:00:59 +02:00
Marcel
063095f58c fix(training): count segmentation blocks regardless of text content
The findSegmentationBlocks query was filtering out blocks with non-empty
text. Segmentation training only needs annotation geometry (polygon/bbox),
not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION
document should count, regardless of whether it has text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 17:14:40 +02:00
Marcel
b6f74fd6fc refactor(annotations): remove overlap check to allow intersecting regions
Historical letter lines often intersect, so the system must support
overlapping annotation regions. Removed the overlap guard from
createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned
up all tests and frontend error mappings that referenced it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:48:18 +02:00
Marcel
8618e520b5 fix(ocr): fill empty MANUAL blocks in guided OCR mode
When a user draws annotation boxes to mark OCR regions, the blocks are
created with source=MANUAL and empty text. upsertGuidedBlock was
protecting all MANUAL blocks unconditionally, so guided OCR silently
produced no output for these drawn-but-empty blocks.

Changed the guard to only protect non-empty MANUAL blocks — empty ones
are treated like OCR blocks and get their text filled in.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:25:23 +02:00
Marcel
ee58b63517 feat(ocr): add guided OCR mode using existing annotation regions
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.

- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
  branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
  TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
  upsertGuidedBlock when annotationId is present; OcrService threads
  the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
  or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
  skips destructive-replace confirmation in guided mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:57:54 +02:00
Marcel
9b2f91ee59 feat(training): add segmentation training pipeline and complete Part 6
- Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain,
  backup rotation, in-process model reload)
- Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout,
  X-Training-Token header)
- Add SegmentationTrainingExportService: PAGE XML export with polygon
  de-normalization and per-page PNG rendering via PDFBox
- Add GET /api/ocr/segmentation-training-data/export endpoint
- Make TranscriptionBlock.text nullable for segmentation-only blocks
  (V31 migration)
- Add Paraglide i18n translation keys for all training UI strings (de/en/es)
- Pass source prop from TranscriptionEditView to TranscriptionBlock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:15:17 +02:00
Marcel
86e9c05aaf feat(training): add Paraglide i18n to training UI components and wire SegmentationTrainingCard
- Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and
  TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys
- Add availableSegBlocks to TrainingInfoResponse to expose segmentation
  block count in the training info endpoint
- Wire SegmentationTrainingCard into admin/system page below OCR training card
- Update api.ts with availableSegBlocks field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 15:14:27 +02:00
Marcel
88e005eb49 feat(ocr): add training history + POST /train + GET /training-info endpoints
- OcrTrainingRun entity + V30 migration (partial unique index prevents
  concurrent runs at DB level)
- OcrTrainingService: concurrent-run guard, 5-block threshold, MDC log
  correlation, orphan recovery on ApplicationReadyEvent
- POST /api/ocr/train (ADMIN) + GET /api/ocr/training-info (ADMIN)
- TRAINING_ALREADY_RUNNING ErrorCode
- 6 OcrTrainingServiceTest + 6 OcrControllerTest tests for the new endpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:47:56 +02:00
Marcel
bc97a2dade feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel()
- POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory,
  ketos transfer learning, timestamped backups (keep last 3), in-process reload
- X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty)
- trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout,
  multipart upload, forwards X-Training-Token when configured)
- TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile
  so /health stays responsive during synchronous training

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:40:53 +02:00
Marcel
cfa3c4df67 feat(training): add recognition training data export
- TrainingDataExportService: PDFBox rendering at 300 DPI, crop by
  annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs
- Skips documents with missing S3 files (logs WARN, continues)
- GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:35:06 +02:00
Marcel
fdf1eb92ad feat(training): add document-level training enrollment
- V29 migration: document_training_labels join table
- TrainingLabel enum: KURRENT_RECOGNITION, KURRENT_SEGMENTATION
- Document.trainingLabels @ElementCollection
- DocumentService.addTrainingLabel / removeTrainingLabel
- PATCH /api/documents/{id}/training-labels (WRITE_ALL)
- Auto-enroll on Kurrent OCR trigger (OcrService.startOcr)
- TranscriptionEditView: enrollment chips in panel footer
- JPQL queries updated to use MEMBER OF trainingLabels

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 14:30:51 +02:00
Marcel
dd47a48d90 feat(ocr): add unique constraint on (job_id, document_id)
Prevents the same document from being added to an OCR job twice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:28:18 +02:00
Marcel
08b1cd5dac fix(ocr): reduce async queue capacity from 100 to 10
Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:58 +02:00
Marcel
5a97316940 fix(ocr): log warning when user ID resolution fails
The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:39 +02:00
Marcel
9282e46a02 fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties
Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:20 +02:00
Marcel
caae2ead81 refactor(ocr): route block lifecycle through TranscriptionService
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:27:01 +02:00
Marcel
6a0fd25662 fix(ocr): persist scriptType override via DocumentService transaction
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:26:37 +02:00