familienarchiv

Author	SHA1	Message	Date
Marcel	7ec3e6170d	feat(fts): backfill search_vector for all existing documents (V35) Fires the BEFORE UPDATE trigger for every documents row, which recomputes the tsvector from all currently-linked metadata, blocks, receivers, and tags. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:35:30 +02:00
Marcel	7d456d8e8b	feat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort - DocumentSort: add RELEVANCE enum value - DocumentSpecifications: remove hasText() ILIKE, add hasIds(List<UUID>) for FTS-pre-filtered ID sets - DocumentService.searchDocuments(): FTS two-phase path — findRankedIdsByFts() returns ranked UUIDs, hasIds() narrows subsequent Specification query, in-memory re-sort preserves rank order; RELEVANCE is the default when text is present and no explicit non-relevance sort is requested - DocumentSpecificationsTest: remove hasText() tests (Specification removed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:35:30 +02:00
Marcel	24530cf85b	feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34) - V34 migration: adds search_vector tsvector column with GIN index - BEFORE INSERT/UPDATE trigger on documents rebuilds vector from title (A), summary + transcription_blocks.text (B), sender/receiver names (C), tag names + location (D) using german FTS config - AFTER triggers on transcription_blocks, document_receivers, document_tags touch the parent document row to re-fire the BEFORE UPDATE trigger - DocumentRepository.findRankedIdsByFts() native query using websearch_to_tsquery - DocumentFtsTest: 12 integration tests covering stemming, trigger sync, ranking, stop words, malformed input, receiver and tag search Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:35:16 +02:00
Marcel	48223d5a3d	devops(backend): pin eclipse-temurin tags, skip test compilation, document jar glob - Pin to eclipse-temurin:21.0.10_7-{jdk,jre}-noble for reproducible builds - Switch -DskipTests to -Dmaven.test.skip=true: skips test compilation entirely, not just execution — faster and avoids build failures from test-only missing classes - Add comment on COPY *.jar explaining why the glob is safe (Spring Boot renames the pre-repackage artifact to .jar.original, leaving only one .jar in target/) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:33:03 +02:00
Marcel	04069c0286	devops(backend): add .dockerignore to exclude target/ from build context Prevents 111MB of compiled output from being sent to the BuildKit daemon on cold builds. Only .mvn/, mvnw, pom.xml, and src/ are needed by the three COPY instructions in the Dockerfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:33:03 +02:00
Marcel	3c46d820ad	devops(backend): switch to multi-stage Docker build Replace runtime mvn spring-boot:run with a proper multi-stage build: - Stage 1 (builder): compiles JAR with BuildKit cache mount for ~/.m2 - Stage 2 (runtime): eclipse-temurin:21-jre with only the JAR Removes the backend source volume mount and maven_cache named volume. Deploy with: docker compose up -d --build Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 11:33:03 +02:00
Marcel	81da127381	refactor(ocr): rename findTop5 to findTop10 for headroom as frontend shows 3 by default Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	f206c0b9e9	test(ocr): add unit tests for triggerSegTraining() — conflict, threshold, happy path, failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	15e532eb96	refactor(ocr): extract assertNoRunningTraining() to eliminate duplicate guard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	b83465020a	fix(backend): store error rate for segmentation training runs setCer() was called for recognition training but not for segmentation. The OCR service now returns cer = 1 - accuracy for segtrain; persist it so the admin panel can display Fehlerrate for both training types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:17:53 +02:00
Marcel	72700bd28f	test(annotations): add Testcontainers integration tests for V33 chk_annotation_bounds [B1] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:36:37 +02:00
Marcel	40c8f548db	docs(annotations): fix ANNOTATION_UPDATE_FAILED Javadoc to reflect 400 status [M3] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:34:55 +02:00
Marcel	a19faa3806	feat(annotations): add @Slf4j and DataIntegrityViolationException catch to updateAnnotation [M2] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:34:03 +02:00
Marcel	f00b470928	test(annotations): add failing test for DataIntegrityViolationException defense [M2 red] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:33:43 +02:00
Marcel	65d606d8bb	test(annotations): add missing height and x boundary validation tests [M4] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:31:07 +02:00
Marcel	4d3207fc27	test(annotations): verify save() is called in updateAnnotation test [M5] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 14:30:50 +02:00
Marcel	ff231db671	feat(annotations): add PATCH endpoint for annotation resize/move Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:42:08 +02:00
Marcel	1558881c01	feat(annotations): add updateAnnotation service method with partial-update DTO Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:39:50 +02:00
Marcel	26c7181ba4	feat(annotations): add ANNOTATION_UPDATE_FAILED error code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:38:33 +02:00
Marcel	f76a6c0ee5	migration(annotations): add chk_annotation_bounds CHECK constraint (V33) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:38:11 +02:00
Marcel	22ee3dce68	fix(api): remove duplicate import and align patchTrainingLabel OpenAPI response to 204 Removed duplicate import of org.mockito.ArgumentMatchers.eq from DocumentControllerTest (lines 32+35). Added @ApiResponse(responseCode="204") to patchTrainingLabel so the generated OpenAPI spec matches the actual NoContent response the controller returns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 10:07:41 +02:00
Marcel	dc283ba271	fix(training): remove @Transactional from triggerTraining to avoid holding DB connection during OCR HTTP call OcrTrainingService.triggerTraining() and triggerSegTraining() held a DB connection open for the entire ketos training run (potentially minutes), risking connection pool exhaustion. Replaced class-level @Transactional with TransactionTemplate for narrow DB writes: guard+create and result-record each run in their own short transaction; the HTTP call to the OCR service runs between them with no open connection. Also replaces blockRepository.findAll().size() with blockRepository.count() in getTrainingInfo() to avoid loading every block into heap on each poll. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 09:59:12 +02:00
Marcel	7b79dc105b	test(migrations): add Testcontainers integration tests for V23 + V30 constraints Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 0s Details V23 introduced a JSONB check constraint (chk_annotation_polygon_quad) requiring polygon arrays to have exactly 4 points. V30 introduced a partial unique index preventing two concurrent RUNNING training runs. These are DB-level invariants that unit tests cannot verify — five Testcontainers tests now assert they are correctly applied by Flyway and enforced by PostgreSQL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 23:07:17 +02:00
Marcel	287920a982	docs(ocr): document single-node constraint for OCR training Training reloads the Kraken model in-process on the Python service. The DB-level RUNNING constraint prevents concurrent API calls but cannot protect against multi-replica deployments. Added explicit comments in docker-compose.yml and OcrTrainingService to prevent accidental horizontal scaling. See ADR-001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 23:01:45 +02:00
Marcel	2b355e748e	fix(ocr): increase presigned URL TTL from 15 min to 1 hour A 100-page document at ~10 s/page takes ~17 min on CPU-only hardware, which could cause the presigned URL to expire mid-OCR job. 1 hour gives ample headroom for any realistic document size in this archive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 23:00:52 +02:00
Marcel	2181fe0b50	test(annotations): fix AnnotationServiceTest — add missing TranscriptionBlockRepository mock The cascade-delete commit (`5a5a8b6`) added blockRepository.deleteByAnnotationId() to AnnotationService.deleteAnnotation(), but the test class was not updated to mock TranscriptionBlockRepository. Mockito injected null, causing deleteAnnotation_succeeds_whenOwner to throw NPE. Adds the mock, verifies the cascade call, and adds an inOrder test asserting the block is deleted before the annotation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 23:00:09 +02:00
Marcel	5a5a8b6e5c	fix(annotations): cascade-delete transcription block when annotation is deleted Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The DELETE endpoint was returning 500 due to a FK constraint violation. `deleteAnnotation` now calls `blockRepository.deleteByAnnotationId()` before removing the annotation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 22:31:02 +02:00
Marcel	49c9022285	fix(training): switch to PAGE XML format for kurrent recognition training Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:45:08 +02:00
Marcel	22954f348a	feat(training): track and display CER per training run After each training run, the Character Error Rate (CER = 1 - accuracy), loss, accuracy, and epoch count are now stored on the OcrTrainingRun record and shown in the training history table. Also adds the missing POST /api/ocr/segtrain endpoint and the triggerSegTraining service method so the segmentation training card can actually trigger training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 19:01:10 +02:00
Marcel	a99afef319	fix(training): only count reviewed blocks as checked text for recognition Previously all MANUAL blocks counted as eligible training data, even ones where text was filled in by guided OCR but never explicitly reviewed. This caused segmentation and recognition counts to always match. Now only reviewed=true blocks qualify for recognition training, so the counts properly reflect: segments = all drawn annotation boxes, checked text = only boxes where the user has verified the transcription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:00:59 +02:00
Marcel	063095f58c	fix(training): count segmentation blocks regardless of text content The findSegmentationBlocks query was filtering out blocks with non-empty text. Segmentation training only needs annotation geometry (polygon/bbox), not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION document should count, regardless of whether it has text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:14:40 +02:00
Marcel	b6f74fd6fc	refactor(annotations): remove overlap check to allow intersecting regions Historical letter lines often intersect, so the system must support overlapping annotation regions. Removed the overlap guard from createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned up all tests and frontend error mappings that referenced it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:48:18 +02:00
Marcel	8618e520b5	fix(ocr): fill empty MANUAL blocks in guided OCR mode When a user draws annotation boxes to mark OCR regions, the blocks are created with source=MANUAL and empty text. upsertGuidedBlock was protecting all MANUAL blocks unconditionally, so guided OCR silently produced no output for these drawn-but-empty blocks. Changed the guard to only protect non-empty MANUAL blocks — empty ones are treated like OCR blocks and get their text filled in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:25:23 +02:00
Marcel	ee58b63517	feat(ocr): add guided OCR mode using existing annotation regions When a document has manually drawn annotation boxes, the user can now enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine skips layout detection entirely and runs recognition only within the pre-drawn bounding boxes, preserving manual transcription blocks. - Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided branch in /ocr/stream groups by page and crops each region - Engines: add extract_region_text() to both Kraken and Surya - Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion, TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to upsertGuidedBlock when annotationId is present; OcrService threads the flag through to runSingleDocument - TranscriptionService: adds upsertGuidedBlock (creates, updates OCR, or preserves MANUAL blocks) - Frontend: guided OCR toggle in OcrTrigger shown when blocks exist; skips destructive-replace confirmation in guided mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:57:54 +02:00
Marcel	9b2f91ee59	feat(training): add segmentation training pipeline and complete Part 6 - Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain, backup rotation, in-process model reload) - Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout, X-Training-Token header) - Add SegmentationTrainingExportService: PAGE XML export with polygon de-normalization and per-page PNG rendering via PDFBox - Add GET /api/ocr/segmentation-training-data/export endpoint - Make TranscriptionBlock.text nullable for segmentation-only blocks (V31 migration) - Add Paraglide i18n translation keys for all training UI strings (de/en/es) - Pass source prop from TranscriptionEditView to TranscriptionBlock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:15:17 +02:00
Marcel	86e9c05aaf	feat(training): add Paraglide i18n to training UI components and wire SegmentationTrainingCard - Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys - Add availableSegBlocks to TrainingInfoResponse to expose segmentation block count in the training info endpoint - Wire SegmentationTrainingCard into admin/system page below OCR training card - Update api.ts with availableSegBlocks field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:14:27 +02:00
Marcel	88e005eb49	feat(ocr): add training history + POST /train + GET /training-info endpoints - OcrTrainingRun entity + V30 migration (partial unique index prevents concurrent runs at DB level) - OcrTrainingService: concurrent-run guard, 5-block threshold, MDC log correlation, orphan recovery on ApplicationReadyEvent - POST /api/ocr/train (ADMIN) + GET /api/ocr/training-info (ADMIN) - TRAINING_ALREADY_RUNNING ErrorCode - 6 OcrTrainingServiceTest + 6 OcrControllerTest tests for the new endpoints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:47:56 +02:00
Marcel	bc97a2dade	feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel() - POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory, ketos transfer learning, timestamped backups (keep last 3), in-process reload - X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty) - trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout, multipart upload, forwards X-Training-Token when configured) - TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile so /health stays responsive during synchronous training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:40:53 +02:00
Marcel	cfa3c4df67	feat(training): add recognition training data export - TrainingDataExportService: PDFBox rendering at 300 DPI, crop by annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs - Skips documents with missing S3 files (logs WARN, continues) - GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:35:06 +02:00
Marcel	fdf1eb92ad	feat(training): add document-level training enrollment - V29 migration: document_training_labels join table - TrainingLabel enum: KURRENT_RECOGNITION, KURRENT_SEGMENTATION - Document.trainingLabels @ElementCollection - DocumentService.addTrainingLabel / removeTrainingLabel - PATCH /api/documents/{id}/training-labels (WRITE_ALL) - Auto-enroll on Kurrent OCR trigger (OcrService.startOcr) - TranscriptionEditView: enrollment chips in panel footer - JPQL queries updated to use MEMBER OF trainingLabels Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:30:51 +02:00
Marcel	dd47a48d90	feat(ocr): add unique constraint on (job_id, document_id) Prevents the same document from being added to an OCR job twice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:28:18 +02:00
Marcel	08b1cd5dac	fix(ocr): reduce async queue capacity from 100 to 10 Queue capacity of 100 is disproportionate for 2 worker threads — a backed-up queue would represent hours of unprocessed OCR jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:58 +02:00
Marcel	5a97316940	fix(ocr): log warning when user ID resolution fails The resolveUserId() catch block was silently swallowing exceptions, making auth failures invisible in logs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:39 +02:00
Marcel	9282e46a02	fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so new fields from the Python OCR service don't crash the Java parser, while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:20 +02:00
Marcel	caae2ead81	refactor(ocr): route block lifecycle through TranscriptionService OcrAsyncRunner was bypassing TranscriptionService — building blocks directly and calling blockRepository.save(), skipping sanitizeText() and saveVersion(). Also replaced N individual deleteBlock() calls with a single bulk deleteAllBlocksByDocument() for OCR re-runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:01 +02:00
Marcel	6a0fd25662	fix(ocr): persist scriptType override via DocumentService transaction OcrService.startOcr() was setting scriptType on a detached entity, silently losing the mutation. Added DocumentService.updateScriptType() with @Transactional to persist the change properly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:37 +02:00
Marcel	2d43f09172	refactor(ocr): move repository access from OcrController into OcrService OcrController was injecting OcrJobRepository and OcrJobDocumentRepository directly, violating the Controller → Service → Repository layering rule. Moved getJob() and getDocumentOcrStatus() logic into OcrService. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:14 +02:00
Marcel	84aca240ea	fix(ocr): remove misleading ANALYZING progress before streaming starts The ANALYZING message appeared while the Python service was still downloading the PDF and loading models. Remove it so the LOADING message ("Lade Modell und Dokument…") stays visible until the first ANALYZING_PAGE event arrives from the stream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:40:54 +02:00
Marcel	292dc66f3c	feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress Replace the single extractBlocks() call with streamBlocks() that processes pages incrementally. Each page's blocks are persisted immediately via createSingleBlock(). Progress updates use the ANALYZING_PAGE:current:total:blocks format. Per-page errors are logged at WARN level without failing the entire job. The batch path (processDocument) remains on the old extractBlocks() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:07:06 +02:00
Marcel	6823973429	refactor(ocr): extract createSingleBlock from createTranscriptionBlocks Enable per-page block creation during streaming by extracting the loop body into a package-private createSingleBlock() method with an explicit sortOrder parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:04:02 +02:00

1 2 3 4 5

234 Commits