Removed duplicate import of org.mockito.ArgumentMatchers.eq from
DocumentControllerTest (lines 32+35). Added @ApiResponse(responseCode="204")
to patchTrainingLabel so the generated OpenAPI spec matches the actual
NoContent response the controller returns.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrTrainingService.triggerTraining() and triggerSegTraining() held a DB
connection open for the entire ketos training run (potentially minutes),
risking connection pool exhaustion. Replaced class-level @Transactional
with TransactionTemplate for narrow DB writes: guard+create and
result-record each run in their own short transaction; the HTTP call to
the OCR service runs between them with no open connection.
Also replaces blockRepository.findAll().size() with blockRepository.count()
in getTrainingInfo() to avoid loading every block into heap on each poll.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
V23 introduced a JSONB check constraint (chk_annotation_polygon_quad)
requiring polygon arrays to have exactly 4 points. V30 introduced a
partial unique index preventing two concurrent RUNNING training runs.
These are DB-level invariants that unit tests cannot verify — five
Testcontainers tests now assert they are correctly applied by Flyway
and enforced by PostgreSQL.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training reloads the Kraken model in-process on the Python service.
The DB-level RUNNING constraint prevents concurrent API calls but
cannot protect against multi-replica deployments. Added explicit
comments in docker-compose.yml and OcrTrainingService to prevent
accidental horizontal scaling. See ADR-001.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A 100-page document at ~10 s/page takes ~17 min on CPU-only hardware,
which could cause the presigned URL to expire mid-OCR job. 1 hour gives
ample headroom for any realistic document size in this archive.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cascade-delete commit (5a5a8b6) added blockRepository.deleteByAnnotationId()
to AnnotationService.deleteAnnotation(), but the test class was not updated to
mock TranscriptionBlockRepository. Mockito injected null, causing deleteAnnotation_succeeds_whenOwner
to throw NPE. Adds the mock, verifies the cascade call, and adds an inOrder test
asserting the block is deleted before the annotation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The DELETE endpoint was returning 500 due to a FK constraint violation.
`deleteAnnotation` now calls `blockRepository.deleteByAnnotationId()`
before removing the annotation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.
- Java export now writes .xml alongside .png (PAGE XML with TextLine,
Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (& < >)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After each training run, the Character Error Rate (CER = 1 - accuracy),
loss, accuracy, and epoch count are now stored on the OcrTrainingRun
record and shown in the training history table.
Also adds the missing POST /api/ocr/segtrain endpoint and the
triggerSegTraining service method so the segmentation training card
can actually trigger training.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all MANUAL blocks counted as eligible training data, even ones
where text was filled in by guided OCR but never explicitly reviewed. This
caused segmentation and recognition counts to always match.
Now only reviewed=true blocks qualify for recognition training, so the
counts properly reflect: segments = all drawn annotation boxes,
checked text = only boxes where the user has verified the transcription.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The findSegmentationBlocks query was filtering out blocks with non-empty
text. Segmentation training only needs annotation geometry (polygon/bbox),
not transcription text — so any MANUAL block on a KURRENT_SEGMENTATION
document should count, regardless of whether it has text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Historical letter lines often intersect, so the system must support
overlapping annotation regions. Removed the overlap guard from
createAnnotation(), deleted ErrorCode.ANNOTATION_OVERLAP, and cleaned
up all tests and frontend error mappings that referenced it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a user draws annotation boxes to mark OCR regions, the blocks are
created with source=MANUAL and empty text. upsertGuidedBlock was
protecting all MANUAL blocks unconditionally, so guided OCR silently
produced no output for these drawn-but-empty blocks.
Changed the guard to only protect non-empty MANUAL blocks — empty ones
are treated like OCR blocks and get their text filled in.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a document has manually drawn annotation boxes, the user can now
enable "Nur annotierte Bereiche" in the OCR trigger panel. The engine
skips layout detection entirely and runs recognition only within the
pre-drawn bounding boxes, preserving manual transcription blocks.
- Python: adds OcrRegion model, extend OcrRequest/OcrBlock; guided
branch in /ocr/stream groups by page and crops each region
- Engines: add extract_region_text() to both Kraken and Surya
- Java: adds OcrBlockResult.annotationId, OcrClient.OcrRegion,
TriggerOcrDTO.useExistingAnnotations; OcrAsyncRunner dispatches to
upsertGuidedBlock when annotationId is present; OcrService threads
the flag through to runSingleDocument
- TranscriptionService: adds upsertGuidedBlock (creates, updates OCR,
or preserves MANUAL blocks)
- Frontend: guided OCR toggle in OcrTrigger shown when blocks exist;
skips destructive-replace confirmation in guided mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /segtrain endpoint to OCR service (ZIP upload, ketos.segtrain,
backup rotation, in-process model reload)
- Add segtrainModel() to OcrClient and RestClientOcrClient (10-min timeout,
X-Training-Token header)
- Add SegmentationTrainingExportService: PAGE XML export with polygon
de-normalization and per-page PNG rendering via PDFBox
- Add GET /api/ocr/segmentation-training-data/export endpoint
- Make TranscriptionBlock.text nullable for segmentation-only blocks
(V31 migration)
- Add Paraglide i18n translation keys for all training UI strings (de/en/es)
- Pass source prop from TranscriptionEditView to TranscriptionBlock
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Convert TrainingHistory, OcrTrainingCard, SegmentationTrainingCard, and
TranscriptionBlock "Nur Segmentierung" badge to use Paraglide message keys
- Add availableSegBlocks to TrainingInfoResponse to expose segmentation
block count in the training info endpoint
- Wire SegmentationTrainingCard into admin/system page below OCR training card
- Update api.ts with availableSegBlocks field
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- POST /train in ocr-service with ZIP Slip validation, TemporaryDirectory,
ketos transfer learning, timestamped backups (keep last 3), in-process reload
- X-Training-Token auth (no-op in dev when TRAINING_TOKEN env is empty)
- trainModel() in OcrClient interface + RestClientOcrClient (10-min timeout,
multipart upload, forwards X-Training-Token when configured)
- TRAINING_TOKEN env var wired in docker-compose; --workers 2 in Dockerfile
so /health stays responsive during synchronous training
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TrainingDataExportService: PDFBox rendering at 300 DPI, crop by
annotation coordinates, ZIP with <uuid>.png + <uuid>.gt.txt pairs
- Skips documents with missing S3 files (logs WARN, continues)
- GET /api/ocr/training-data/export (ADMIN); 204 when no enrolled blocks
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Queue capacity of 100 is disproportionate for 2 worker threads — a
backed-up queue would represent hours of unprocessed OCR jobs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resolveUserId() catch block was silently swallowing exceptions,
making auth failures invisible in logs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ANALYZING message appeared while the Python service was still
downloading the PDF and loading models. Remove it so the LOADING
message ("Lade Modell und Dokument…") stays visible until the first
ANALYZING_PAGE event arrives from the stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable per-page block creation during streaming by extracting the
loop body into a package-private createSingleBlock() method with an
explicit sortOrder parameter.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend sends progress codes (PREPARING, LOADING, ANALYZING,
CREATING_BLOCKS:N, DONE:N, ERROR) via OcrJob.progressMessage.
Frontend translates them via Paraglide (de/en/es) and displays
below the spinner.
- V27 migration: adds progress_message column to ocr_jobs
- OcrAsyncRunner updates progress at each phase
- Poll interval reduced to 2s for snappier updates
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).
Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JDK HttpClient defaults to HTTP/2 with upgrade negotiation. Uvicorn
rejects the upgrade ('Unsupported upgrade request'), causing the
request body to be lost and a 422 'Field required' from FastAPI.
Force HTTP/1.1 since the OCR service is internal and doesn't need h2.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CallerRunsPolicy would cause the HTTP request to hang for minutes
if the queue is full. AbortPolicy with queue=100 is safe — the queue
will never realistically fill for a family archive. If it somehow
does, a clear error is better than a silent multi-minute hang.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Better to wait than to error. Queue capacity 100 holds plenty of
OCR jobs. CallerRunsPolicy means if the queue is somehow full,
the request blocks instead of getting rejected with an exception.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old pool (1 thread, queue=1) meant OCR blocked all other async
tasks (imports). Now 2 concurrent async tasks with a queue of 10
— enough for OCR + import to run in parallel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>