familienarchiv

Author	SHA1	Message	Date
Marcel	89a18c430e	fix(training): limit CPU threads and epochs to prevent RAM exhaustion Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2 (--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a laptop OOM-killed the container. 10 epochs is sufficient for incremental fine-tuning runs; more data is added over time and training re-run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:09:13 +02:00
Marcel	8dec5b5976	fix(training): disable DataLoader workers in subprocess training DataLoader worker subprocesses crash inside Docker due to multiprocessing fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain so data loading runs in the main process. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:58:32 +02:00
Marcel	e33164c4aa	fix(training): use ketos CLI subprocess instead of missing Python API kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are only exposed as CLI commands. Rewrites both training functions to invoke `ketos train` / `ketos segtrain` via subprocess and parse the best val_metric from checkpoint filenames. Also fixes the OcrTrainingCard history so it only shows non-blla runs (recognition model), matching SegmentationTrainingCard which already filtered to blla-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 20:50:21 +02:00
Marcel	22954f348a	feat(training): track and display CER per training run After each training run, the Character Error Rate (CER = 1 - accuracy), loss, accuracy, and epoch count are now stored on the OcrTrainingRun record and shown in the training history table. Also adds the missing POST /api/ocr/segtrain endpoint and the triggerSegTraining service method so the segmentation training card can actually trigger training. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 19:01:10 +02:00
Marcel	a99afef319	fix(training): only count reviewed blocks as checked text for recognition Previously all MANUAL blocks counted as eligible training data, even ones where text was filled in by guided OCR but never explicitly reviewed. This caused segmentation and recognition counts to always match. Now only reviewed=true blocks qualify for recognition training, so the counts properly reflect: segments = all drawn annotation boxes, checked text = only boxes where the user has verified the transcription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 18:00:59 +02:00
Marcel	1fd5c31fd1	fix(training): pass trainingInfo directly to SegmentationTrainingCard The parent was manually remapping availableSegBlocks → availableBlocks before passing props, which broke after the card was updated to read availableSegBlocks directly. Pass the full trainingInfo object instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:55:16 +02:00
Marcel	a514cbca18	fix(training): segmentation card reads availableSegBlocks not availableBlocks Both cards were reading the same availableBlocks field, so the segmentation box always showed the kurrent recognition count. Use the correct availableSegBlocks field from the training info response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:54:20 +02:00
Marcel	063095f58c	fix(training): count segmentation blocks regardless of text content The findSegmentationBlocks query was filtering out blocks with non-empty text. Segmentation training only needs annotation geometry (polygon/bbox), not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION document should count, regardless of whether it has text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 17:14:40 +02:00
Marcel	b6f74fd6fc	refactor(annotations): remove overlap check to allow intersecting regions Historical letter lines often intersect, so the system must support overlapping annotation regions. Removed the overlap guard from createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned up all tests and frontend error mappings that referenced it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:48:18 +02:00
Marcel	8618e520b5	fix(ocr): fill empty MANUAL blocks in guided OCR mode When a user draws annotation boxes to mark OCR regions, the blocks are created with source=MANUAL and empty text. upsertGuidedBlock was protecting all MANUAL blocks unconditionally, so guided OCR silently produced no output for these drawn-but-empty blocks. Changed the guard to only protect non-empty MANUAL blocks — empty ones are treated like OCR blocks and get their text filled in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:25:23 +02:00
Marcel	3e34366702	fix(ocr): use cw-1/ch-1 for synthetic baseline bounds to pass Kraken's >= check Kraken's segmentation bounds check rejects coordinates where any point satisfies x >= im.width or y >= im.height (strictly >=, not >). Using (cw, ch) as the boundary corner was triggering this for every crop. Changed to (cw-1, ch-1) so all coordinates are strictly inside the image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:21:00 +02:00
Marcel	051c43f088	fix(ocr): use synthetic baseline in guided OCR to avoid blla crash on small crops blla.segment() is a full-page layout detection model that kills the worker process when called on tiny annotation crops (e.g. 597x89 px). For guided OCR the annotation region IS already the text line, so segmentation is unnecessary. Replace the blla call with a single synthetic BaselineLine that spans the full crop width — rpred then runs recognition on the whole crop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 16:09:35 +02:00
Marcel	ee58b63517	feat(ocr): add guided OCR mode using existing annotation regions When a document has manually drawn annotation boxes, the user can now enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine skips layout detection entirely and runs recognition only within the pre-drawn bounding boxes, preserving manual transcription blocks. - Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided branch in /ocr/stream groups by page and crops each region - Engines: add extract_region_text() to both Kraken and Surya - Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion, TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to upsertGuidedBlock when annotationId is present; OcrService threads the flag through to runSingleDocument - TranscriptionService: adds upsertGuidedBlock (creates, updates OCR, or preserves MANUAL blocks) - Frontend: guided OCR toggle in OcrTrigger shown when blocks exist; skips destructive-replace confirmation in guided mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:57:54 +02:00
Marcel	9b2f91ee59	feat(training): add segmentation training pipeline and complete Part 6 - Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain, backup rotation, in-process model reload) - Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout, X-Training-Token header) - Add SegmentationTrainingExportService: PAGE XML export with polygon de-normalization and per-page PNG rendering via PDFBox - Add GET /api/ocr/segmentation-training-data/export endpoint - Make TranscriptionBlock.text nullable for segmentation-only blocks (V31 migration) - Add Paraglide i18n translation keys for all training UI strings (de/en/es) - Pass source prop from TranscriptionEditView to TranscriptionBlock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:15:17 +02:00
Marcel	86e9c05aaf	feat(training): add Paraglide i18n to training UI components and wire SegmentationTrainingCard - Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys - Add availableSegBlocks to TrainingInfoResponse to expose segmentation block count in the training info endpoint - Wire SegmentationTrainingCard into admin/system page below OCR training card - Update api.ts with availableSegBlocks field Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 15:14:27 +02:00
Marcel	4e08d31e01	feat(admin): add OCR training card to admin/system page - TrainingHistory.svelte: responsive table with status badges (green/red/animated pulse), keyed iteration, empty-state row - OcrTrainingCard.svelte: shows available blocks/docs, disabled states (< 5 blocks, service down), in-flight "…" state, 5s success message, embeds TrainingHistory - Wired into admin/system/+page.svelte via fetchTrainingInfo() in $effect - Regenerated api.ts with OcrTrainingRun + TrainingInfoResponse types - TRAINING_ALREADY_RUNNING error code in errors.ts + de/en/es translations - 7 OcrTrainingCard Vitest tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:58:13 +02:00
Marcel	88e005eb49	feat(ocr): add training history + POST /train + GET /training-info endpoints - OcrTrainingRun entity + V30 migration (partial unique index prevents concurrent runs at DB level) - OcrTrainingService: concurrent-run guard, 5-block threshold, MDC log correlation, orphan recovery on ApplicationReadyEvent - POST /api/ocr/train (ADMIN) + GET /api/ocr/training-info (ADMIN) - TRAINING_ALREADY_RUNNING ErrorCode - 6 OcrTrainingServiceTest + 6 OcrControllerTest tests for the new endpoints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:47:56 +02:00
Marcel	bc97a2dade	feat(ocr): add /train endpoint to OCR service and OcrClient.trainModel() - POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory, ketos transfer learning, timestamped backups (keep last 3), in-process reload - X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty) - trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout, multipart upload, forwards X-Training-Token when configured) - TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile so /health stays responsive during synchronous training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:40:53 +02:00
Marcel	cfa3c4df67	feat(training): add recognition training data export - TrainingDataExportService: PDFBox rendering at 300 DPI, crop by annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs - Skips documents with missing S3 files (logs WARN, continues) - GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:35:06 +02:00
Marcel	fdf1eb92ad	feat(training): add document-level training enrollment - V29 migration: document_training_labels join table - TrainingLabel enum: KURRENT_RECOGNITION, KURRENT_SEGMENTATION - Document.trainingLabels @ElementCollection - DocumentService.addTrainingLabel / removeTrainingLabel - PATCH /api/documents/{id}/training-labels (WRITE_ALL) - Auto-enroll on Kurrent OCR trigger (OcrService.startOcr) - TranscriptionEditView: enrollment chips in panel footer - JPQL queries updated to use MEMBER OF trainingLabels Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 14:30:51 +02:00
Marcel	73229077be	feat(transcription): add sticky review progress counter to TranscriptionEditView Shows 'X / Y geprüft' with a brand-mint progress bar at the top of the transcription panel. Derived from the blocks prop — no extra state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 13:59:35 +02:00
Marcel	33dc4654e5	fix(ocr): use correct Kraken record attributes for line geometry Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line' and 'cuts'. The fallback used record.line which doesn't exist, causing AttributeError on every Kurrent OCR page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:16:25 +02:00
Marcel	70689b8f7b	feat(ocr): add SSRF protection for PDF URL downloads Some checks failed CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details CI / Unit & Component Tests (push) Failing after 2s Details CI / Backend Unit Tests (push) Failing after 0s Details Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist (default: minio,localhost,127.0.0.1) and disables redirect following to prevent redirect-based SSRF. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:42 +02:00
Marcel	0beaf351f0	fix(docker): soften ocr-service dependency and clean up compose Changed ocr-service dependency from service_healthy to service_started since the backend already handles OCR unavailability gracefully. Removed unused APP_S3_INTERNAL_URL env var. Added expose directive and .dockerignore for ocr-service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:21 +02:00
Marcel	b7fd4018c2	fix(frontend): normalize paraglide imports and improve accessibility Changed OcrTrigger and ScriptTypeSelect from 'import * as m' to 'import { m }' to match the rest of the codebase. Increased ScriptTypeSelect label to text-sm and annotation badge font to 12px for better readability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:29:00 +02:00
Marcel	8c07779a91	fix(ocr): fix SSE retry to actually reconnect EventSource The retry button set status='running' but didn't re-trigger the $effect because jobId hadn't changed. Added retryCount state so the effect re-runs and creates a fresh EventSource on retry. Also added aria-label to the progress bar for accessibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:28:40 +02:00
Marcel	dd47a48d90	feat(ocr): add unique constraint on (job_id, document_id) Prevents the same document from being added to an OCR job twice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:28:18 +02:00
Marcel	08b1cd5dac	fix(ocr): reduce async queue capacity from 100 to 10 Queue capacity of 100 is disproportionate for 2 worker threads — a backed-up queue would represent hours of unprocessed OCR jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:58 +02:00
Marcel	5a97316940	fix(ocr): log warning when user ID resolution fails The resolveUserId() catch block was silently swallowing exceptions, making auth failures invisible in logs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:39 +02:00
Marcel	9282e46a02	fix(ocr): handle unknown NDJSON fields with @JsonIgnoreProperties Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so new fields from the Python OCR service don't crash the Java parser, while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:20 +02:00
Marcel	caae2ead81	refactor(ocr): route block lifecycle through TranscriptionService OcrAsyncRunner was bypassing TranscriptionService — building blocks directly and calling blockRepository.save(), skipping sanitizeText() and saveVersion(). Also replaced N individual deleteBlock() calls with a single bulk deleteAllBlocksByDocument() for OCR re-runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:27:01 +02:00
Marcel	6a0fd25662	fix(ocr): persist scriptType override via DocumentService transaction OcrService.startOcr() was setting scriptType on a detached entity, silently losing the mutation. Added DocumentService.updateScriptType() with @Transactional to persist the change properly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:37 +02:00
Marcel	2d43f09172	refactor(ocr): move repository access from OcrController into OcrService OcrController was injecting OcrJobRepository and OcrJobDocumentRepository directly, violating the Controller → Service → Repository layering rule. Moved getJob() and getDocumentOcrStatus() logic into OcrService. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:26:14 +02:00
Marcel	410ef88e1a	refactor(ocr): delete unused OcrProgressBar component Some checks failed CI / Unit & Component Tests (push) Failing after 1s Details CI / Backend Unit Tests (push) Failing after 1s Details CI / Unit & Component Tests (pull_request) Failing after 1s Details CI / Backend Unit Tests (pull_request) Failing after 1s Details The skipped-pages warning is inlined directly in +page.svelte. The component and its tests are no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:53:10 +02:00
Marcel	6b94882409	fix(ocr): remove redundant page counter from progress display The progress message already says "Seite 3 von 7 wird analysiert…" so the separate "3 / 7" counter was redundant. Remove the OcrProgressBar from the page and inline only the skipped-pages warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:50:05 +02:00
Marcel	b868da07cd	fix(ocr): remove progress bar, keep text-only page counter The thin bar without a border looked broken at low progress values. The text counter (e.g. "1 / 6") already communicates progress clearly so the bar is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:46:29 +02:00
Marcel	84aca240ea	fix(ocr): remove misleading ANALYZING progress before streaming starts The ANALYZING message appeared while the Python service was still downloading the PDF and loading models. Remove it so the LOADING message ("Lade Modell und Dokument…") stays visible until the first ANALYZING_PAGE event arrives from the stream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:40:54 +02:00
Marcel	3fe6eedffb	feat(ocr): allow re-running OCR when transcription blocks already exist Add a collapsible OCR trigger below the block list in edit mode. Uses a <details> element so it's unobtrusive — the primary workflow is editing existing blocks, but users can expand to re-run OCR with a confirmation dialog that warns about replacing existing blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:37:51 +02:00
Marcel	69768a104d	test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers Cover Surya polygon/word-level extraction, health endpoint states, Kraken script-type routing, 503 when models not ready, 400 when Kraken unavailable for Kurrent, and confidence marker application during streaming. Production code coverage: 88%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:34:23 +02:00
Marcel	97e5138934	fix(ocr): use 1-based page numbers to match frontend PDF viewer The PDF viewer uses 1-based currentPage (starting at 1) but the OCR engines produced 0-based pageNumber from enumerate(). Annotations created by OCR were assigned to page 0, which doesn't exist in the viewer. Change enumerate() to start=1 in both engines and the streaming endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:32:08 +02:00
Marcel	bac67706b9	feat(ocr): integrate progress bar and streaming progress into document page Replace inline translateOcrProgress with the extracted module. Add OcrProgressBar below the spinner during OCR. Parse page numbers from ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill bar to 100% briefly before clearing the overlay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:15:55 +02:00
Marcel	035f9768bd	feat(ocr): add OcrProgressBar component with page-based ARIA semantics Progress bar shows brand-mint fill on brand-sand background with smooth transition. Displays page counter with tabular-nums and skipped-pages warning in amber when applicable. Only renders when totalPages > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:13:57 +02:00
Marcel	ddec64fc79	feat(ocr): extract translateOcrProgress with ANALYZING_PAGE and DONE:skipped support Move translateOcrProgress from page.svelte to a testable module. Return structured result with currentPage/totalPages/skippedPages for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages parsing. Add i18n keys for de/en/es. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:09:29 +02:00
Marcel	292dc66f3c	feat(ocr): rewrite runSingleDocument to use streamBlocks with per-page progress Replace the single extractBlocks() call with streamBlocks() that processes pages incrementally. Each page's blocks are persisted immediately via createSingleBlock(). Progress updates use the ANALYZING_PAGE:current:total:blocks format. Per-page errors are logged at WARN level without failing the entire job. The batch path (processDocument) remains on the old extractBlocks() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:07:06 +02:00
Marcel	6823973429	refactor(ocr): extract createSingleBlock from createTranscriptionBlocks Enable per-page block creation during streaming by extracting the loop body into a package-private createSingleBlock() method with an explicit sortOrder parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:04:02 +02:00
Marcel	93c3154b3c	feat(ocr): implement NDJSON streaming in RestClientOcrClient Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON response line by line with a dedicated ObjectMapper. Falls back to the old /ocr endpoint via the default method when /ocr/stream returns 404. Uses a separate HttpClient with 5-minute request timeout for streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:03:12 +02:00
Marcel	641e91d5a3	feat(ocr): add default streamBlocks method to OcrClient interface The default method synthesizes Start/Page/Done events from extractBlocks() results, providing backward compatibility for implementations that don't support streaming natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:01:26 +02:00
Marcel	e21d01e10b	feat(ocr): add OcrStreamEvent sealed interface with Start/Page/Error/Done records Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed interface with record subtypes for exhaustive pattern matching in the consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:00:02 +02:00
Marcel	97c6cf6a65	feat(ocr): add NDJSON streaming endpoint POST /ocr/stream Streams one JSON line per completed page instead of buffering the entire result. Emits start/page/error/done events. On per-page failure, logs the traceback but yields a generic error message and continues with the next page. Adds X-Accel-Buffering: no and Cache-Control: no-cache headers for reverse-proxy compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:57:57 +02:00
Marcel	b7d5f71ef7	refactor(ocr): extract extract_page_blocks() from both OCR engines Enable per-page processing by extracting the inner loop body of extract_blocks() into extract_page_blocks(image, page_idx, language). The original extract_blocks() now delegates to the new function, preserving backward compatibility for the batch path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:56:34 +02:00

1 2 3 4 5 ...

805 Commits