feat(ocr): German spell-check post-processing to reduce handwriting gibberish #260
Reference in New Issue
Block a user
Delete Branch "feat/issue-254-german-spell-check"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #254
Summary
Adds a German spell-check post-processing step after OCR confidence filtering. For
HANDWRITING_KURRENTandHANDWRITING_LATINscript types, each token is looked up inpyspellchecker's German dictionary (augmented with a 153K-word DTA 19th-century historical wordlist). Plausible OCR errors are corrected and marked with[?](e.g.Hauus → Haus[?]); unresolvable gibberish that survived the confidence filter becomes[unleserlich]. Typewritten output is unaffected. A newocr-testsCI job runstest_spell_check.pyandtest_confidence.pywithout the full ML stack.🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
Observations
1. Importing a private function across module boundaries (
spell_check.py:8)_collapse_adjacent_markerscarries an underscore prefix, signalling "internal toconfidence.py." Importing a_-prefixed name from another module contradicts that signal — future readers ofconfidence.pywon't know the function is part of the module's effective public API. This was a deliberate decision in the issue thread, and the function genuinely belongs inconfidence.py(it's about marker semantics). The fix is cheap: drop the underscore to make the contract explicit.This is the only blocker I'd flag. It's a naming inconsistency, not a correctness bug.
2.
de_historical.txtas a committed 1.5MB blob — acceptable for this project scale, but worth noting. Every update to the wordlist will add another 1.5MB delta to git history. If the wordlist is ever updated more than 2–3 times, consider usinggit lfsor regenerating the file externally and fetching it at container build time. Not a blocker now.3. Module structure is clean.
spell_check.pyhas a clear public API (load_spell_checker,correct_text), private helpers with underscore prefixes, and no reach into other modules except via the explicit import above. The_SPELL_CHECK_SCRIPT_TYPESconstant inmain.pyis well-placed.4. No ADR needed for this change — it extends an existing subsystem rather than introducing a new architectural boundary.
Recommendation
_collapse_adjacent_markers→collapse_adjacent_markersinconfidence.pyand update the import inspell_check.py. One line change, zero behavioral impact.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
Blockers
_collapse_adjacent_markersimported across module boundary (spell_check.py:8)The underscore prefix is a Python convention for "private to this module." Importing it from
confidence.pyinspell_check.pyviolates that contract and will confuse future readers. Markus flagged the same issue from an architecture angle — the fix is identical:Suggestions
correct_textis doing three things — token classification, correction logic, AND collapse. At 50 lines it's approaching the extract-me signal. A private_check_token(token) -> strhelper that handles one token's decision tree would make the loop body a single-purpose orchestrator. Not a blocker; a refactor-phase suggestion.Threshold comment missing (
spell_check.py:110) —> 50is a non-obvious constant with a specific reason (avoids non-determinism when candidates tie at the floor). Per the codebase's comment policy, a one-liner here earns its place:TDD evidence — test file was committed before implementation, tests cover all 16 cases from the plan, and the
autousefixture prevents unloaded-checker surprises. Solid red/green discipline throughout. ✅🔐 Nora Steiner — Application Security Engineer
Verdict: ✅ Approved
What I checked
This change adds a pure in-process text transformation pipeline (no network calls, no file writes, no subprocess invocations). The threat surface is narrow.
Historical wordlist loading (
spell_check.py:29-37) — Path is constructed viaos.path.join(os.path.dirname(__file__), "dictionaries", "de_historical.txt"). This is a static path derived from the module location, not user-controlled input. No path traversal vector.Input processing —
correct_text()receives OCR output from trusted internal callers (after it has already been through Kraken/Surya). No external user input reaches this function directly. Even if malicious text were passed in, the function only splits on whitespace and replaces tokens — no eval, no subprocess, no template rendering.pyspellchecker==0.9.0supply chain — exact pin is good hygiene. The library is pure Python (no native extensions), reducing the binary supply chain risk surface.load_spell_checker()startup failure — the plan says "fail hard if it raises." Looking atmain.py, this runs inlifespanwhich will propagate exceptions and prevent the app from starting. Correct behavior — silent degradation would mean undetected data quality issues.Notes
The
_MIN_SPELL_CHECK_LEN = 4constant avoids processing very short tokens. This is good — short-word spell-check results are notoriously unstable and could in theory cause denial-of-service via high-cost correction lookups on pathological inputs. The exemption closes that avenue.Nothing security-relevant to block on.
🧪 Sara Holt — QA Engineer
Verdict: ⚠️ Approved with concerns
Blockers
None.
Suggestions
test_correctable_ocr_error_gets_correctedasserts an exact string (test_spell_check.py:58-59)This test pins the exact output of
pyspellchecker's correction algorithm for"Hauus". If a future library update changes the internal frequency weights or correction ranking, this test fails — not because the feature broke, but because the dictionary changed. Consider a softer assertion:This makes the intent explicit without coupling to a specific library version's dictionary state.
Missing main.py integration test — the spell-check is wired into three code paths in
main.py(block mode, stream mode, guided mode), but there are no tests verifying this wiring. I understand the ML stack makes this expensive; a minimal smoke test with theHANDWRITING_KURRENTscript type mocking out the OCR engine would give confidence that the routing logic is correct. Suggest creating a follow-up issue rather than blocking this PR.Positive notes:
autouse=Truefixture prevents the unloaded-checker edge case from slipping through ✅monkeypatch.setattrfor the not-loaded test is the right approach — no state leaks between tests ✅🛠 Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
What I checked
New
ocr-testsjob (.gitea/workflows/ci.yml) — exactly the spec from the issue. Two-step pip install + pytest invocation, correctworking-directory: ocr-service, runs bothtest_spell_check.pyandtest_confidence.pyin one command. Clean.No ML stack dependency — the job installs only
pyspellchecker==0.9.0andpytest. This keeps CI runtime to under 30 seconds for this job, no GPU runners needed. ✅Version pin consistency —
pip install "pyspellchecker==0.9.0"in CI matchesrequirements.txtexactly. No drift risk. ✅ubuntu-latest— this is the same runner used by all other jobs in this workflow. Consistent. If we ever want reproducibility across the workflow, pinning toubuntu-24.04would make sense, but that's a whole-workflow change, not something to block this PR on.Minor suggestion
The pip install command installs into the system Python on the runner, not a venv. That's fine for CI — runners are ephemeral. But if we ever add more jobs that need conflicting packages, a
python -m venv .venv && .venv/bin/pip install ...pattern would isolate them. Not needed now.de_historical.txtcommitted as a static artifact — 1.5MB static file in the repo is a pragmatic call. The alternative (downloading 243MB of DTA ZIPs on every CI run) would add 2+ minutes and an external dependency on the DTA server. Static artifact is the right choice for a family project. ✅🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR is entirely within the OCR service Python backend — no frontend components, no Svelte files, no UI changes, no user-facing strings modified.
The markers
[unleserlich]and[?]that this feature produces are displayed in the transcription view, but that rendering was already in place before this PR. If a future PR exposes these markers more prominently in the UI (e.g. highlight[?]corrections in a different colour, offer inline correction acceptance), that would be the right moment to review the UX treatment — colour-alone cues, icon+text pairs, accessible labeling. Not in scope here.Nothing to flag.
Review feedback addressed
Three concerns from review cycle 1 have been resolved:
fea24ae—refactor(ocr): make collapse_adjacent_markers a public functionDrops the underscore prefix from
_collapse_adjacent_markersinconfidence.pyand updates the import inspell_check.py. Resolves the module-boundary concern raised by @Markus and @Felix.ec85f22—refactor(ocr): document > 50 frequency threshold rationaleAdds an inline comment explaining the strict
> 50threshold: it prevents non-deterministic corrections when multiple candidates tie at the frequency floor. Resolves @Felix's suggestion.c5e6ed9—test(ocr): decouple correction tests from exact library dictionary stateReplaces exact-string assertions in
test_correctable_ocr_error_gets_correctedandtest_sentence_with_multiple_correctionswith structural assertions (not illegible, marker present, expected stem). Resolves @Sara's suggestion.Deferred:
main.pyintegration test coverage for spell-check routing tracked as issue #262.All 37 tests green.
🏗️ Markus Keller — Senior Application Architect (Review Cycle 2)
Verdict: ✅ Approved
The cycle 1 blocker is resolved.
collapse_adjacent_markersis now a public function inconfidence.py(no underscore) and the import inspell_check.pycorrectly uses the public name. The module contract is now consistent: readers ofconfidence.pysee a public helper that is part of the effective API, and readers ofspell_check.pysee a normal cross-module import.Everything else from cycle 1 stands — module structure is clean,
main.pyintegration points are well-placed, and thede_historical.txtartifact decision is pragmatic and documented.Ready to merge.
👨💻 Felix Brandt — Senior Fullstack Developer (Review Cycle 2)
Verdict: ✅ Approved
All cycle 1 concerns addressed:
collapse_adjacent_markersis public — import inspell_check.py:8uses the bare name. ✅> 50 # strict > avoids non-determinism when candidates tie at the frequency floor— exactly the right one-liner. ✅test_correctable_ocr_error_gets_correctedandtest_sentence_with_multiple_correctionsnow assert behavior (marker present, stem matches, not illegible) without coupling to the library's exact dictionary state. ✅The
correct_textfunction at ~50 lines is still a refactor candidate (_check_tokenhelper would clean up the loop body), but that's a future-sprint note, not a merge blocker. The function is well-named, well-documented, and does nothing surprising.TDD discipline held through both cycles. Ship it.
🔐 Nora Steiner — Application Security Engineer (Review Cycle 2)
Verdict: ✅ Approved
No new security-relevant changes in this cycle. The three commits (public rename, threshold comment, test softening) are all cosmetic or test changes — no new code paths, no new network calls, no new file I/O.
The security posture from cycle 1 is unchanged and clean: pure in-process transformation, static wordlist path, exact pin on
pyspellchecker==0.9.0, fail-hard startup. Nothing to flag.🧪 Sara Holt — QA Engineer (Review Cycle 2)
Verdict: ✅ Approved
Both cycle 1 suggestions were addressed:
Test assertions decoupled —
test_correctable_ocr_error_gets_correctednow uses four structural assertions, andtest_sentence_with_multiple_correctionsuses token-indexed assertions. These tests verify behavior without binding to a specific version's dictionary weights. The intent of each assertion is self-documenting ("[?]" in result,result.startswith("Haus")). ✅Integration test deferred to issue #262 — the ticket is well-scoped: three parametrized tests (block / stream / guided), mock strategy spelled out, the three patch targets identified (
_download_and_convert_pdf,kraken_engine.extract_blocks,correct_text). That's actionable and properly separated from this PR. ✅37/37 passing. Ready to merge.
🛠 Tobias Wendt — DevOps & Platform Engineer (Review Cycle 2)
Verdict: ✅ Approved
No changes to CI or infrastructure files in this cycle. The
ocr-testsjob,requirements.txtpin, andde_historical.txtstatic artifact are all unchanged and were approved in cycle 1.Nothing new to flag. Good to go.
🎨 Leonie Voss — UX Designer & Accessibility Strategist (Review Cycle 2)
Verdict: ✅ Approved
Still entirely backend Python — no frontend, no Svelte, no UI strings. Nothing to flag. Same as cycle 1.