Commit Graph

1057 Commits

Author SHA1 Message Date
Marcel
38a9719bdb fix(frontend): QUEUED badge test, touch target on dismiss button, focus ring on expand toggle
Add missing test coverage for the amber QUEUED status badge in TrainingHistory.
Fix WCAG 2.2 minimum touch target (24 × 24 px) on the success-message dismiss
button in OcrTrainingCard. Add focus-visible ring to the expand/collapse toggle
in TrainingHistory so keyboard users get a visible focus indicator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
699d5e5759 refactor(ocr): mark _SenderModelRegistry.contains as private (_contains)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
afe84a6af7 fix(ocr): add partial unique index and align SenderModelServiceTest with suite style
Add V42 partial unique index on ocr_training_runs(person_id) WHERE status='QUEUED'
to enforce the per-person queued coalescing guarantee at the DB level. Also adds
@ExtendWith(MockitoExtension.class) to SenderModelServiceTest for consistency with
the rest of the service test suite, with lenient() on the shared txTemplate stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
3ee4424556 perf(ocr): resolve person names in single batch query in getTrainingInfo
Replace the per-run getById loop with a single getAllById call on distinct
person IDs, eliminating the N+1 query when training history contains multiple
sender model runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
b23118268b refactor(ocr): return TrainingInfoResponse directly from getTrainingInfo endpoint
Remove the intermediate Map<String,Object> and return the typed record directly
so OpenAPI codegen produces a concrete TypeScript type. Fixes lastRun serializing
as {} (empty object) instead of null when no training run exists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
4ddb095cb1 test(ocr): add /train-sender auth tests and run sender registry tests in CI
Add 503/403 auth tests for the /train-sender endpoint, matching the pattern
already used for /train and /segtrain. Also surface test_sender_registry.py
in CI (it needs no ML stack) and add pytest-asyncio to the install step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
af49bf5e7a docs(ocr): document tail-recursive queue drain design in promoteNextQueuedRun
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
3bbc64cfc6 refactor(ocr): rename _contains to contains in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
8860f17129 fix(frontend): show person name inline in mobile status cell in TrainingHistory
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
5e4f031537 fix(frontend): show error on training start failure, add aria-live and dismiss to success message
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
8acae8ea4d refactor(frontend): extract shared TrainingRun type to $lib/types/training.ts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
7a500644a9 test(ocr): add failure path and DONE status assertions to SenderModelServiceTest
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
3b3f960a30 refactor(ocr): extract exportSenderData helper in triggerSenderTraining
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
8e844dd16e style(ocr): add Image type hints to extract_page_blocks and extract_region_text
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
12c4d433ba chore(ocr): lower OCR_MAX_CACHED_MODELS to 2 with memory budget comment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
16787f2771 test(ocr): verify load failure does not cache broken entry in SenderModelRegistry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
c3939e0f13 refactor(ocr): move person-name enrichment from OcrController into OcrTrainingService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
4f86011ffb test(ocr): verify triggerSenderTraining upserts SenderModel with correct path and cer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
3ecda655c5 fix(ocr): eliminate race window in runOrQueueSenderTraining by creating RUNNING row atomically
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
68ec66002a fix(ocr): correct trainSenderModel URI from /train to /train-sender
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
a296ad527e feat(ocr): wire SenderModelService into OcrAsyncRunner; stage missing foundational files
OcrAsyncRunner now passes the per-sender model path to streamBlocks for
HANDWRITING_KURRENT documents. processDocument replaced extractBlocks
with streamBlocks + AtomicReference, removing the unchecked raw-array
pattern.

Also stages all previously uncommitted foundational files for this
feature: SenderModel entity, SenderModelRepository, Flyway migrations
V40/V41, updated OcrClient/RestClientOcrClient streaming API,
TrainingDataExportService.exportForSender, TranscriptionService Kurrent
hook, application.yaml OCR config, and frontend i18n/test additions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
a8bd2606a0 refactor(ocr): move sender training methods from OcrTrainingService to SenderModelService
Eliminates cross-domain repository access: OcrTrainingService no longer
holds SenderModelRepository. SenderModelService now owns the full sender
training lifecycle (runOrQueueSenderTraining, triggerSenderTraining,
promoteNextQueuedRun), removing the circular dependency risk.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
607a3567e6 refactor(ocr): delete buildTrainingInfoMap() dead code
The controller now builds the map inline (with personNames support).
This method had zero callers.

Fixes reviewer concerns from @felixbrandt and @mkeller.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
7cc90b8a90 fix(ocr): log debug instead of silently swallowing person name resolution errors
Replaces catch(Exception ignored){} with log.debug() in getTrainingInfo().
Adds controller test documenting the graceful degradation behavior
(response stays 200 when personService.getById() throws).

Fixes reviewer concerns from @felixbrandt and @nullx.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
cd31bf63c1 feat(frontend): wire personNames to TrainingHistory in OcrTrainingCard
Extends Run interface with personId and QUEUED status, TrainingInfo with
personNames map, and passes it through to TrainingHistory for per-sender
model column display.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
add799c57f chore: regenerate API types for per-sender model additions
OcrTrainingRun now includes personId (uuid, optional) and QUEUED status.
TrainingInfoResponse includes runs array with personId fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
a146a2ec3c feat(ocr): per-sender model registry and /train-sender endpoint
engines/kraken.py:
- Add _SenderModelRegistry with LRU eviction (max configurable via
  OCR_MAX_CACHED_MODELS env var), double-checked locking, invalidate(),
  and path whitelist (/app/models/ only)
- Add _load_sender_model() helper for testability
- extract_page_blocks() and extract_region_text() accept optional
  sender_model_path; route to sender registry when provided

models.py:
- OcrRequest gains senderModelPath: str | None = None field

main.py:
- /ocr and /ocr/stream pass request.senderModelPath to Kraken engine
- New /train-sender endpoint: validates output_model_path, runs ketos
  train with base model as starting point, invalidates sender cache

docker-compose.yml:
- Add OCR_MAX_CACHED_MODELS: "5" to ocr-service environment

test_sender_registry.py:
- 4 tests: cache hit, LRU eviction, invalidate, path traversal guard

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
548ad0fa68 test: add unit tests for SenderModelService, runOrQueueSenderTraining, and updateBlock hook
- SenderModelServiceTest: 6 tests covering activation threshold (99/100),
  retrain delta (149/150), runNow flag (queued vs triggered)
- OcrTrainingServiceTest: 3 tests for runOrQueueSenderTraining — idle returns
  true, running saves QUEUED, duplicate QUEUED coalesces
- TranscriptionServiceTest: 3 tests for updateBlock — sets source=MANUAL,
  triggers training for HANDWRITING_KURRENT with sender, skips when no sender

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
e3c8e1a067 test: fix broken tests after per-sender model integration
- OcrAsyncRunnerTest: switch from extractBlocks/4-arg streamBlocks stubs
  to 5-arg streamBlocks (senderModelPath param) via doAnswer
- TranscriptionServiceTest: stub documentService.getDocumentById in
  updateBlock tests so the new Kurrent training hook does not NPE
- OcrControllerTest: add @MockitoBean PersonService (now injected into
  OcrController for personNames assembly in getTrainingInfo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:30:54 +02:00
Marcel
78eca8e9a1 docs(ocr): add Admin OCR overview & model-detail UI spec
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m36s
CI / OCR Service Tests (push) Successful in 27s
CI / Backend Unit Tests (push) Failing after 1m22s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 19:11:07 +02:00
Marcel
c5e6ed922b test(ocr): decouple correction tests from exact library dictionary state
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 36s
CI / Backend Unit Tests (pull_request) Failing after 2m47s
CI / Unit & Component Tests (push) Failing after 2m33s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m41s
Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:23:09 +02:00
Marcel
ec85f228c1 refactor(ocr): document > 50 frequency threshold rationale
Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:21:37 +02:00
Marcel
fea24aee25 refactor(ocr): make collapse_adjacent_markers a public function
Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.

Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:20:31 +02:00
Marcel
68b57918eb ci: add ocr-tests job for spell_check and confidence unit tests
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m48s
CI / OCR Service Tests (push) Successful in 1m59s
CI / Backend Unit Tests (push) Failing after 2m53s
CI / Unit & Component Tests (pull_request) Failing after 2m52s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Failing after 2m54s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:55:07 +02:00
Marcel
77100ab1e6 feat(ocr): integrate spell-check post-processing for handwriting script types
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:54:17 +02:00
Marcel
092131930c feat(ocr): add spell_check module with German spellchecker and historical wordlist
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:52:50 +02:00
Marcel
47f9a0bf73 test(ocr): add failing tests for spell_check module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:51:38 +02:00
Marcel
30a6cbeb7f feat(ocr): add DTA-derived historical German wordlist and generation script
153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:48:26 +02:00
Marcel
6faaa3b7d6 feat(ocr): add pyspellchecker dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:41:24 +02:00
Marcel
77747aa556 refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:40:39 +02:00
Marcel
9a64c0698f fix(pdf): make isLoaded reactive so nav buttons are enabled after load
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m50s
CI / Backend Unit Tests (push) Failing after 2m47s
pdfDoc was a plain variable (not \$state), so renderer.isLoaded had no
reactive dependencies in Svelte 5. PdfControls received isLoaded=false
permanently, keeping the next-page button disabled while zoom buttons
(which have no disabled attribute) still worked.

Fix: derive isLoaded from totalPages (\$state) via totalPages > 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:55:42 +02:00
Marcel
4cb7c975f5 test(ocr): add resilience tests for tiny image and unexpected exception propagation
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m27s
CI / Backend Unit Tests (pull_request) Failing after 2m37s
CI / Unit & Component Tests (push) Failing after 3m14s
CI / Backend Unit Tests (push) Has been cancelled
Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page
fallback from except Exception to (cv2.error, ValueError, MemoryError) so
programming errors propagate instead of being silently swallowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:16:17 +02:00
Marcel
97c94c91f8 test(ocr): guard translateOcrProgress fallback for PREPROCESSING_PAGE with missing colon parts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:13:52 +02:00
Marcel
eaefd4091e feat(ocr): add PREPROCESSING_PAGE progress translation and i18n strings
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m34s
CI / Backend Unit Tests (push) Failing after 2m57s
CI / Unit & Component Tests (pull_request) Failing after 2m36s
CI / Backend Unit Tests (pull_request) Failing after 2m43s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:27:42 +02:00
Marcel
ba36a88b65 feat(ocr): add Preprocessing NDJSON event to Java stream pipeline
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:21:00 +02:00
Marcel
b310caaeeb feat(ocr): integrate preprocessing into stream and batch endpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:16:47 +02:00
Marcel
615d404ba9 chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:14:47 +02:00
Marcel
7183fc4428 feat(ocr): add image preprocessing module with CLAHE + grayscale + blur
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:13:42 +02:00
Marcel
bf010a23c3 docs(tag-input): add clarifying comments for non-obvious design decisions
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 2m26s
CI / Backend Unit Tests (pull_request) Failing after 2m44s
CI / Unit & Component Tests (push) Failing after 2m28s
CI / Backend Unit Tests (push) Failing after 25s
- SvelteMap satisfies svelte/prefer-svelte-reactivity; $derived.by() handles reactivity
- ‹›› prefix only on depth=0 context ancestors; indentation serves deeper nodes
- fetchedForQuery set after suggestions causes harmless double $derived evaluation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 12:17:22 +02:00
Marcel
b01761d800 test(tag-input): add regression guard for allowCreation=false + Enter on suggestion
Confirms that Enter on a suggestion item adds the tag even when allowCreation is
false — the activeIndex guard in handleKeydown runs before the allowCreation check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 12:14:35 +02:00