feat(ocr): German spell-check post-processing to reduce handwriting gibberish #260

Merged
marcel merged 10 commits from feat/issue-254-german-spell-check into main 2026-04-17 17:28:41 +02:00

10 Commits

Author SHA1 Message Date
Marcel
c5e6ed922b test(ocr): decouple correction tests from exact library dictionary state
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 36s
CI / Backend Unit Tests (pull_request) Failing after 2m47s
CI / Unit & Component Tests (push) Failing after 2m33s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m41s
Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:23:09 +02:00
Marcel
ec85f228c1 refactor(ocr): document > 50 frequency threshold rationale
Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:21:37 +02:00
Marcel
fea24aee25 refactor(ocr): make collapse_adjacent_markers a public function
Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.

Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:20:31 +02:00
Marcel
68b57918eb ci: add ocr-tests job for spell_check and confidence unit tests
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m48s
CI / OCR Service Tests (push) Successful in 1m59s
CI / Backend Unit Tests (push) Failing after 2m53s
CI / Unit & Component Tests (pull_request) Failing after 2m52s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Failing after 2m54s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:55:07 +02:00
Marcel
77100ab1e6 feat(ocr): integrate spell-check post-processing for handwriting script types
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:54:17 +02:00
Marcel
092131930c feat(ocr): add spell_check module with German spellchecker and historical wordlist
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:52:50 +02:00
Marcel
47f9a0bf73 test(ocr): add failing tests for spell_check module
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:51:38 +02:00
Marcel
30a6cbeb7f feat(ocr): add DTA-derived historical German wordlist and generation script
153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:48:26 +02:00
Marcel
6faaa3b7d6 feat(ocr): add pyspellchecker dependency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:41:24 +02:00
Marcel
77747aa556 refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:40:39 +02:00