feat(ocr): German spell-check post-processing to reduce handwriting gibberish #254
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal: Add a German spell-check post-processing step after OCR so that high-confidence gibberish from Kraken (Kurrent/Süterlin) and Surya (Latin handwriting) is corrected where possible and replaced with
[unleserlich]when no plausible correction exists.Architecture: A new
spell_check.pymodule wrapspyspellcheckerwith a German frequency dictionary plus a supplementarydictionaries/de_historical.txtfile for pre-reform spellings (Thür, Rath, Wirth, etc.). The module exposesload_spell_checker()(called at startup) andcorrect_text(text) -> str(called afterapply_confidence_markers()in block/stream mode and afterextract_region_text()in guided mode). Spell checking only runs forHANDWRITING_KURRENTandHANDWRITING_LATIN— typewritten output is already accurate.Tech Stack:
pyspellchecker>=0.8.0(pure Python, no system deps), existing FastAPI/Kraken/Surya OCR stack.File Map
ocr-service/requirements.txtpyspellcheckerocr-service/dictionaries/de_historical.txtocr-service/spell_check.pyload_spell_checker,correct_textocr-service/test_spell_check.pyspell_check.pyocr-service/main.pyTask 1: Add
pyspellcheckerand create historical wordlistFiles:
Modify:
ocr-service/requirements.txtCreate:
ocr-service/dictionaries/de_historical.txtStep 1: Add dependency
In
ocr-service/requirements.txt, add after thehttpxline:Step 2: Create the historical wordlist directory and file
Create
ocr-service/dictionaries/de_historical.txtwith the following content (one word per line, lines starting with#are comments):Step 3: Commit
Task 2: Create
spell_check.py(TDD)Files:
ocr-service/test_spell_check.pyocr-service/spell_check.pyStep 1: Write failing tests
Create
ocr-service/test_spell_check.py:Run to verify all fail:
Expected:
ModuleNotFoundError: No module named 'spell_check'Step 2: Write minimal implementation
Create
ocr-service/spell_check.py:Install the new dependency in the venv:
Run tests to verify they pass:
Expected: all green. If
test_correctable_ocr_error_gets_correctedis flaky, confirmpyspellcheckerversion ≥ 0.8.0 is installed.Commit:
Task 3: Integrate into
main.pyFiles:
ocr-service/main.pyThe integration has three sub-locations: startup, block/stream mode, guided mode.
Step 1: Write failing integration test (manual smoke test)
Confirm the service starts without error after changes by running:
This will fail until Step 2 is done. That's expected.
Step 2: Add import and startup call
At the top of
ocr-service/main.py, add the import alongside the existingconfidenceimport:Add a module-level constant just below the
ALLOWED_PDF_HOSTSblock (around line 40):In the
lifespanasync context manager, add the spell checker load afterkraken_engine.load_models():Step 3: Apply spell correction in block mode (
POST /ocr)In
run_ocr, find the block post-processing loop (around line 103):Before:
After:
Step 4: Apply spell correction in stream mode (
POST /ocr/stream, full-page path)In
generate()insiderun_ocr_stream, find the block loop (around line 221):Before:
After:
Step 5: Apply spell correction in guided mode (
POST /ocr/stream, regions path)In
generate_guided()insiderun_ocr_stream, find the region block construction (around line 163):Before:
After:
Step 6: Verify import works
Run:
Expected:
import OKStep 7: Run the full test suite
Run all OCR service tests:
Expected: all existing tests still pass, all new
test_spell_check.pytests pass.Step 8: Commit
Done Criteria
python -m pytest -vpasses inocr-service/with no regressionspython -c "from main import app"exits cleanly"German spell checker ready"alongside Kraken model loadingPOST /ocrwithscriptType: HANDWRITING_KURRENTapplies spell correctionPOST /ocr/stream(both full-page and guided/regions mode) applies spell correction forHANDWRITING_LATINandHANDWRITING_KURRENTPOST /ocrwithscriptType: TYPEWRITERis unaffected (no spell check applied)🏗️ Markus Keller — Senior Application Architect
Observations
spell_check.pyfollows the same pattern asengines/kraken.pyandengines/surya.py: module-level lazy state, aload_*()function, and a focused public API. The integration hook inmain.pyis the right place.ILLEGIBLE_MARKERis defined twice.confidence.pydefines it at line 8 and the proposedspell_check.pydefines it again. These two modules now share a contract — the marker string — without expressing that dependency. If the marker changes in one file and not the other,_collapse_adjacent_markersinspell_check.pywill fail to recognize markers written byapply_confidence_markers()and will produce doubled[unleserlich] [unleserlich]output silently. Fix: inspell_check.py, import the constant fromconfidence.py:_models_readygate does not cover spell checker failure. Thelifespanfunction sets_models_ready = Trueimmediately afterload_spell_checker()returns. Ifload_spell_checker()raises (e.g., the bundled German dictionary is corrupted, orpyspellcheckeris not installed), the exception propagates,_models_readystaysFalse, and the service never starts — this is correct and acceptable. However, ifload_spell_checker()silently succeeds but the spell checker is in a bad state,correct_text()raisesRuntimeErroron first use. For robustness, wrap in a try/except at startup and either abort or log a warning and skip spell checking:Recommendations
ILLEGIBLE_MARKERfromconfidence.py— do not redefine it. One source of truth.load_spell_checker()so the service's behavior is predictable if the dependency is missing or broken.dictionaries/de_historical.txtis a good pattern — it will grow over time as users identify missing words. Consider documenting this in a comment at the top of the file so future contributors know it is the intended extension point.👨💻 Felix Brandt — Senior Fullstack Developer
Observations
ILLEGIBLE_MARKERis duplicated.confidence.pyalready defines it.spell_check.pyredefining it independently violates DRY and creates a silent correctness risk. Import it:from confidence import ILLEGIBLE_MARKER.correct_text()does two things. It applies word-level spell correction AND collapses adjacent[unleserlich]markers. The marker-collapsing logic is already duplicated fromapply_confidence_markers()inconfidence.py. Both functions walk a token list and collapse adjacent markers. Extract this into a private helper_collapse_markers(tokens: list[str]) -> list[str]inconfidence.py(it belongs there — markers are confidence's concept) and call it from both.Constant naming is confusing.
_SHORT_WORD_MAX_LEN = 3combined withlen(token) <= _SHORT_WORD_MAX_LENmeans words of length exactly 3 are exempt. The constant name implies "max" but the logic exempts words at or below it._MIN_SPELL_CHECK_LEN = 4withlen(token) < _MIN_SPELL_CHECK_LENmakes the same logic readable without mental arithmetic.Task 3 Step 1 is not a failing test. The plan labels
python -c "from main import app"as "Write failing integration test." This command succeeds before any changes — it's not red. Either write a real pytest test that importscorrect_textfrommainand asserts it's called, or relabel this step as a "smoke check" so it's not confused with TDD.test_correctable_ocr_error_gets_correctedwill likely be brittle. It assertsresult == "Haus"which pins behavior to a specific version of pyspellchecker's German dictionary. If the package is updated and "hauus" → something other than "haus", the test fails with no code change. Safer assertion:assert result != "[unleserlich]" and result != "Hauus".Recommendations
ILLEGIBLE_MARKERfromconfidence.py, don't redefine it._collapse_markers(tokens)intoconfidence.pyand call it from both modules._SHORT_WORD_MAX_LEN = 3→_MIN_SPELL_CHECK_LEN = 4, flip comparison to<.🔍 Sara Holt — QA Engineer & Test Strategist
Observations
Test coverage has several gaps worth closing before merging:
test_correctable_ocr_error_gets_correctedis brittle. It assertsresult == "Haus"— this pins the test to a specific version of pyspellchecker's bundled German dictionary. A patch update to the package could change the top-ranked candidate and break this test without any code change. Safer approach:test_capitalization_preserved_on_correctionhas a too-weak assertion. Checkingresult[0].isupper()would pass even for"[unleserlich]"(which starts with[, not uppercase) or for"G". Strengthen it:Missing test:
correct_text()beforeload_spell_checker()— the code raisesRuntimeErrorbut no test covers this path.Missing test: tokens with punctuation attached. Historical OCR output often includes punctuation run together with words:
"Haus,","Garten.","Herr.". These are not in any dictionary and would become[unleserlich]. This is a real false-positive risk for the feature. Test this explicitly and decide: strip trailing punctuation before checking, or accept the behavior and document it.Missing test: numeric tokens. Common in historical documents:
"1870er","18.","§5". The current 3-char exemption won't help for"1870er"(7 chars). Will these become[unleserlich]?No CI step runs OCR service tests. Looking at
.gitea/workflows/ci.yml, there is no job that runspytestinocr-service/. All new tests will be invisible to CI. This is a pre-existing gap but adding tests makes it urgent — see Tobias's comment for a proposed CI job.Recommendations
🔐 Nora "NullX" Steiner — Application Security Engineer
Observations
This is a pure text-processing change with a very small attack surface. No new HTTP endpoints, no external network calls at runtime, no user-controlled input paths. Overall: low risk. Two items worth addressing:
1. Loose version pin is a supply chain concern (minor)
pyspellchecker>=0.8.0allows any future version to be silently pulled when the Docker image is rebuilt. Every other package inrequirements.txtis exactly pinned (fastapi[standard]==0.115.6,surya-ocr==0.17.1, etc.). This one should be too:A range spec means two Docker builds on different days can produce different behavior. Exact pins make the build reproducible and make dependency audits meaningful.
2.
pyspellchecker's German dictionary is bundled in the package — no runtime network callI verified this: pyspellchecker ships its language dictionaries as compressed JSON files inside the Python package itself.
load_spell_checker()reads from the package installation, not from the network. No SSRF risk, no network dependency at runtime. ✅3. Historical wordlist is not user-controlled — no injection risk
de_historical.txtis read from a repo-controlled path relative to__file__. Lines are stripped and filtered. The spell checker'sword_frequency.load_words()accepts plain strings — no shell, no eval, no templating. ✅4. Correctness note:
ILLEGIBLE_MARKERduplicationIf the marker string diverges between
confidence.pyandspell_check.py, the collapse logic incorrect_text()would fail to recognize existing markers and could produce doubled markers in output. This is a correctness bug, not a security issue, but I flag it because silent data corruption in an archival system is a trust problem.Recommendations
pyspellcheckerto an exact version inrequirements.txt.🛠️ Tobias Wendt — DevOps & Platform Engineer
Observations
1. Loose version pin — breaks reproducible builds
pyspellchecker>=0.8.0is the only inexact pin inrequirements.txt. Every rebuild of the Docker image could silently pull a different version. Other packages in the file are all exact-pinned. Fix this before merging:2. No CI job runs OCR service tests
Checked
.gitea/workflows/ci.yml— there is no step that runspytestinocr-service/. The newtest_spell_check.pyand existingtest_confidence.pywill never run in CI. This is a pre-existing gap that's now urgent to close.Add an
ocr-testsjob. Note that the heavy ML dependencies (surya, kraken, torch) cannot be installed in a standard CI runner — buttest_spell_check.pyandtest_confidence.pyonly needpyspellcheckerandpytest. We can install just those:Do NOT add
test_engines.pyortest_stream.pyhere — they require the full Kraken/Surya/PyTorch stack which is multi-GB and cannot run in CI.3. Docker image impact — minimal
pyspellcheckeris pure Python with no C extensions. The bundled German dictionary is ~500KB. The existing Dockerfile correctly COPYsrequirements.txtbefore source code, so this dependency lands in a cached layer. No additional volumes, no build-time downloads, no size concern.4. Startup time — acceptable
load_spell_checker()reads the compressed dictionary JSON from the package installation. Benchmarked locally: <1 second. The OCR servicehealthcheckhasstart_period: 120s, easily covering this.Existing issues flagged (not introduced by this issue):
minio/minio:latest,axllent/mailpit:latest,minio/mc— all:latesttags. Should be pinned.actions/upload-artifact@v3in CI — deprecated, should be@v4.Recommendations
pyspellcheckerto an exact version.ocr-testsCI job shown above — without it, the new tests are never run automatically.🎨 Leonie Voss — UX Design Lead
No concerns from my angle.
This issue is entirely contained within the OCR Python microservice — no Svelte components, no frontend routes, no UI changes, and no user-visible output format changes (the
[unleserlich]marker was already present in transcriptions). I reviewed the file map to confirm: all changes are inocr-service/. The frontend already renders[unleserlich]as part of transcription block text. Nothing to assess from a UX or accessibility perspective.👨💻 Felix Brandt — Senior Fullstack Developer
Pre-implementation discussion — four open items worked through and resolved. The plan needs updating before implementation starts.
Resolved
1. Historical wordlist — DTA corpus instead of handcrafted list
The 50-word handcrafted
de_historical.txtis too thin for real Kurrent documents. Replace it with a corpus-derived wordlist from the Deutsches Textarchiv (DTA).scripts/prepare_historical_dict.pyas a one-time data preparation stephttps://www.deutschestextarchiv.de/media/download/dtak/2020-10-23/original/1800-1899.zip— 140MB, 684 textshttps://www.deutschestextarchiv.de/media/download/dtae/2020-10-23/original/1800-1899.zip— 105MB, 2015 textsocr-service/dictionaries/de_historical.txtsorted by frequency descendingde_historical.txtas a static artifact; the script lives inscripts/for reproducibilityURLs verified live (HTTP 200). This is a one-time task — the script is not part of the normal dev workflow.
2. Version pin + frequency threshold + correction marker
pyspellchecker==0.9.0(current installed version, verified in container)freq_correction > 50. The pyspellchecker German dictionary uses50as a floor value — 75%+ of words sit at exactly this value. Words at the floor are known but unranked; correcting toward them produces non-deterministic results when multiple candidates tie at50. Tested:Garttenhas two candidates (garten,gatten) both at50— first run returnedgatten, subsequent runsgarten. Threshold blocks this correctly.[?]to corrected tokens — e.g.Hauus→Haus[?]. DefineCORRECTION_MARKER = "[?]"inconfidence.pyalongsideILLEGIBLE_MARKER. The frontend already mutes any[…]token in read mode —[?]is handled automatically with no frontend changes needed.test_correctable_ocr_error_gets_correctedto assertresult == "Haus[?]"instead ofresult == "Haus".3. Punctuation-attached tokens
OCR output commonly produces
"Haus,"or„Garten"— punctuation attached to the word. Neither form is in any dictionary, causing false[unleserlich]markers on perfectly readable words._strip_punctuation(token: str) -> tuple[str, str, str]returning(leading_punct, word, trailing_punct)Haus,→Haus,[?][unleserlich]— attached punctuation on an illegible marker is noise4.
_collapse_markersduplication — required plan updateThe adjacent-marker collapse loop in
correct_text()duplicates logic fromconfidence.py. WithCORRECTION_MARKERnow also living inconfidence.py, that module is the right home for the shared helper._collapse_adjacent_markers(tokens: list[str]) -> list[str]intoconfidence.py(private)apply_confidence_markers()andcorrect_text()The implementation plan needs four updates before coding begins: add the DTA preparation script as a prerequisite, update the version pin, add the frequency threshold and
[?]marker to thespell_check.pyspec, and add the_collapse_adjacent_markersextraction toconfidence.py. All decisions above are agreed with the project owner.Implementation complete ✅
All 7 tasks implemented on branch
feat/issue-254-german-spell-check.What was implemented
Task 1 —
confidence.pyrefactor (77747aa)CORRECTION_MARKER = "[?]"alongsideILLEGIBLE_MARKER_collapse_adjacent_markers(tokens)helper fromapply_confidence_markersTask 2 — dependency (
6faaa3b)pyspellchecker==0.9.0added torequirements.txt(exact pin)Task 3 — DTA historical wordlist (
30a6cbe)scripts/prepare_historical_dict.py: downloads dtak+dtae 1800–1899 original-spelling corpora, tokenises, filters (min_freq=20), writes wordlistocr-service/dictionaries/de_historical.txt: 153,547 words committed as static artifactTask 4 — failing tests (
47f9a0b)test_spell_check.py(all red before implementation)Task 5 —
spell_check.py(0921319)ILLEGIBLE_MARKER,CORRECTION_MARKER,_collapse_adjacent_markersfromconfidence.py— no duplication_MIN_SPELL_CHECK_LEN = 4withlen(word) < 4guard_strip_punctuation(): strips non-alphanumeric from token ends before lookup, reattaches after correction_is_numeric(): digit-containing tokens pass through unchanged (handles1870er,18.)spell.word_frequency[correction] > 50[?](e.g.Hauus → Haus[?])Task 6 —
main.pyintegration (77100ab)_SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"}load_spell_checker()called at startup after Kraken model loadTask 7 — CI (
68b5791)ocr-testsjob in.gitea/workflows/ci.yml: installs onlypyspellchecker==0.9.0+pytest, runstest_spell_check.pyandtest_confidence.py— no ML stack requiredFinal test run