Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2
(--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a
laptop OOM-killed the container. 10 epochs is sufficient for incremental
fine-tuning runs; more data is added over time and training re-run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DataLoader worker subprocesses crash inside Docker due to multiprocessing
fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain
so data loading runs in the main process.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kraken.ketos has no .train or .segtrain attributes in Kraken 7 — both are
only exposed as CLI commands. Rewrites both training functions to invoke
`ketos train` / `ketos segtrain` via subprocess and parse the best
val_metric from checkpoint filenames.
Also fixes the OcrTrainingCard history so it only shows non-blla runs
(recognition model), matching SegmentationTrainingCard which already
filtered to blla-only.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.
Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all MANUAL blocks counted as eligible training data, even ones
where text was filled in by guided OCR but never explicitly reviewed. This
caused segmentation and recognition counts to always match.
Now only reviewed=true blocks qualify for recognition training, so the
counts properly reflect: segments = all drawn annotation boxes,
checked text = only boxes where the user has verified the transcription.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The parent was manually remapping availableSegBlocks → availableBlocks
before passing props, which broke after the card was updated to read
availableSegBlocks directly. Pass the full trainingInfo object instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both cards were reading the same availableBlocks field, so the segmentation
box always showed the kurrent recognition count. Use the correct
availableSegBlocks field from the training info response.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The findSegmentationBlocks query was filtering out blocks with non-empty
text. Segmentation training only needs annotation geometry (polygon/bbox),
not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION
document should count, regardless of whether it has text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Historical letter lines often intersect, so the system must support
overlapping annotation regions. Removed the overlap guard from
createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned
up all tests and frontend error mappings that referenced it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a user draws annotation boxes to mark OCR regions, the blocks are
created with source=MANUAL and empty text. upsertGuidedBlock was
protecting all MANUAL blocks unconditionally, so guided OCR silently
produced no output for these drawn-but-empty blocks.
Changed the guard to only protect non-empty MANUAL blocks — empty ones
are treated like OCR blocks and get their text filled in.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken's segmentation bounds check rejects coordinates where any point
satisfies x >= im.width or y >= im.height (strictly >=, not >). Using
(cw, ch) as the boundary corner was triggering this for every crop.
Changed to (cw-1, ch-1) so all coordinates are strictly inside the image.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
blla.segment() is a full-page layout detection model that kills the worker
process when called on tiny annotation crops (e.g. 597x89 px). For guided
OCR the annotation region IS already the text line, so segmentation is
unnecessary. Replace the blla call with a single synthetic BaselineLine that
spans the full crop width — rpred then runs recognition on the whole crop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.
- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
upsertGuidedBlock when annotationId is present; OcrService threads
the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
skips destructive-replace confirmation in guided mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain,
backup rotation, in-process model reload)
- Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout,
X-Training-Token header)
- Add SegmentationTrainingExportService: PAGE XML export with polygon
de-normalization and per-page PNG rendering via PDFBox
- Add GET /api/ocr/segmentation-training-data/export endpoint
- Make TranscriptionBlock.text nullable for segmentation-only blocks
(V31 migration)
- Add Paraglide i18n translation keys for all training UI strings (de/en/es)
- Pass source prop from TranscriptionEditView to TranscriptionBlock
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and
TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys
- Add availableSegBlocks to TrainingInfoResponse to expose segmentation
block count in the training info endpoint
- Wire SegmentationTrainingCard into admin/system page below OCR training card
- Update api.ts with availableSegBlocks field
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory,
ketos transfer learning, timestamped backups (keep last 3), in-process reload
- X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty)
- trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout,
multipart upload, forwards X-Training-Token when configured)
- TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile
so /health stays responsive during synchronous training
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TrainingDataExportService: PDFBox rendering at 300 DPI, crop by
annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs
- Skips documents with missing S3 files (logs WARN, continues)
- GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shows 'X / Y geprüft' with a brand-mint progress bar at the top of the
transcription panel. Derived from the blocks prop — no extra state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BaselineOCRRecord has 'baseline' and 'boundary' attributes, not 'line'
and 'cuts'. The fallback used record.line which doesn't exist, causing
AttributeError on every Kurrent OCR page.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist
(default: minio,localhost,127.0.0.1) and disables redirect following
to prevent redirect-based SSRF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed ocr-service dependency from service_healthy to service_started
since the backend already handles OCR unavailability gracefully. Removed
unused APP_S3_INTERNAL_URL env var. Added expose directive and
.dockerignore for ocr-service.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changed OcrTrigger and ScriptTypeSelect from 'import * as m' to
'import { m }' to match the rest of the codebase. Increased
ScriptTypeSelect label to text-sm and annotation badge font to 12px
for better readability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The retry button set status='running' but didn't re-trigger the $effect
because jobId hadn't changed. Added retryCount state so the effect
re-runs and creates a fresh EventSource on retry. Also added aria-label
to the progress bar for accessibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The skipped-pages warning is inlined directly in +page.svelte.
The component and its tests are no longer needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The progress message already says "Seite 3 von 7 wird analysiert…"
so the separate "3 / 7" counter was redundant. Remove the
OcrProgressBar from the page and inline only the skipped-pages
warning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The thin bar without a border looked broken at low progress values.
The text counter (e.g. "1 / 6") already communicates progress clearly
so the bar is unnecessary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ANALYZING message appeared while the Python service was still
downloading the PDF and loading models. Remove it so the LOADING
message ("Lade Modell und Dokument…") stays visible until the first
ANALYZING_PAGE event arrives from the stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a collapsible OCR trigger below the block list in edit mode.
Uses a <details> element so it's unobtrusive — the primary workflow
is editing existing blocks, but users can expand to re-run OCR with
a confirmation dialog that warns about replacing existing blocks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace inline translateOcrProgress with the extracted module. Add
OcrProgressBar below the spinner during OCR. Parse page numbers from
ANALYZING_PAGE progress codes and feed them to the bar. On Done, fill
bar to 100% briefly before clearing the overlay.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Progress bar shows brand-mint fill on brand-sand background with
smooth transition. Displays page counter with tabular-nums and
skipped-pages warning in amber when applicable. Only renders when
totalPages > 0.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move translateOcrProgress from page.svelte to a testable module.
Return structured result with currentPage/totalPages/skippedPages
for the progress bar. Add ANALYZING_PAGE and DONE with skipped pages
parsing. Add i18n keys for de/en/es.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page processing by extracting the inner loop body of
extract_blocks() into extract_page_blocks(image, page_idx, language).
The original extract_blocks() now delegates to the new function,
preserving backward compatibility for the batch path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>