feat(ocr): German spell-check post-processing to reduce handwriting gibberish #254

Closed
opened 2026-04-17 12:44:53 +02:00 by marcel · 8 comments
Owner

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add a German spell-check post-processing step after OCR so that high-confidence gibberish from Kraken (Kurrent/Süterlin) and Surya (Latin handwriting) is corrected where possible and replaced with [unleserlich] when no plausible correction exists.

Architecture: A new spell_check.py module wraps pyspellchecker with a German frequency dictionary plus a supplementary dictionaries/de_historical.txt file for pre-reform spellings (Thür, Rath, Wirth, etc.). The module exposes load_spell_checker() (called at startup) and correct_text(text) -> str (called after apply_confidence_markers() in block/stream mode and after extract_region_text() in guided mode). Spell checking only runs for HANDWRITING_KURRENT and HANDWRITING_LATIN — typewritten output is already accurate.

Tech Stack: pyspellchecker>=0.8.0 (pure Python, no system deps), existing FastAPI/Kraken/Surya OCR stack.


File Map

File Action Purpose
ocr-service/requirements.txt Modify Add pyspellchecker
ocr-service/dictionaries/de_historical.txt Create Supplementary historical German wordlist
ocr-service/spell_check.py Create Spell-check module: load_spell_checker, correct_text
ocr-service/test_spell_check.py Create Unit tests for spell_check.py
ocr-service/main.py Modify Import + call at startup and both OCR integration points

Task 1: Add pyspellchecker and create historical wordlist

Files:

  • Modify: ocr-service/requirements.txt

  • Create: ocr-service/dictionaries/de_historical.txt

  • Step 1: Add dependency

    In ocr-service/requirements.txt, add after the httpx line:

    pyspellchecker>=0.8.0
    
  • Step 2: Create the historical wordlist directory and file

    Create ocr-service/dictionaries/de_historical.txt with the following content (one word per line, lines starting with # are comments):

    # Historical German spellings common in Kurrent/Süterlin documents (19th–early 20th century)
    # Pre-1901 and pre-1996 reform spellings not found in modern German frequency dictionaries
    
    # Door / gate
    Thür
    Thüre
    Thüren
    
    # Animal
    Thier
    Thiere
    Thieren
    Thierisch
    
    # Place names (old spelling)
    Cöln
    Coeln
    Thüringen
    
    # Council / administration
    Rath
    Räthe
    Stadtrath
    Gemeinderath
    Bürgerrath
    Kirchenrath
    Schulrath
    Landesrath
    
    # Innkeeper / host
    Wirth
    Wirths
    Wirthschaft
    Wirthshaus
    Hauswirth
    
    # Widow / widower (old spellings)
    Wittwe
    Witthum
    Wittwer
    
    # Help (old spelling)
    Hülfe
    Gehülfe
    Gehülfin
    
    # Need / distress (old spelling)
    Noth
    noth
    
    # Godparent
    Gevatter
    Gevatterin
    Taufpate
    Taufpatin
    
    # Occupation / social standing
    Nahrungsstand
    Geselle
    Lehrling
    Knecht
    Magd
    Kürschner
    Sattler
    
    # Church / civil records
    Taufschein
    Geburtsschein
    Sterberegister
    Heiratsurkunde
    Kirchenbuch
    Ehelichkeit
    Ehegattin
    
    # Spelling variants
    Strasse
    Grossvater
    Grossmutter
    Aelteste
    
  • Step 3: Commit

    git add ocr-service/requirements.txt ocr-service/dictionaries/de_historical.txt
    git commit -m "feat(ocr): add pyspellchecker dependency and historical German wordlist"
    

Task 2: Create spell_check.py (TDD)

Files:

  • Create: ocr-service/test_spell_check.py
  • Create: ocr-service/spell_check.py

Step 1: Write failing tests

  • Create ocr-service/test_spell_check.py:

    """Tests for OCR spell-check post-processing."""
    
    import pytest
    from spell_check import correct_text, load_spell_checker
    
    
    @pytest.fixture(autouse=True)
    def ensure_loaded():
        load_spell_checker()
    
    
    def test_known_german_word_passes_through():
        assert correct_text("Haus") == "Haus"
    
    
    def test_obvious_gibberish_replaced_with_marker():
        assert correct_text("xqzwrpvmk") == "[unleserlich]"
    
    
    def test_short_word_exempt_from_check():
        # words of 3 chars or fewer are never checked — too many false positives
        assert correct_text("im") == "im"
        assert correct_text("der") == "der"
        assert correct_text("zu") == "zu"
    
    
    def test_unleserlich_marker_preserved():
        assert correct_text("[unleserlich]") == "[unleserlich]"
    
    
    def test_mixed_text_correct_and_gibberish():
        result = correct_text("Haus xqzwrpvmk Garten")
        assert result == "Haus [unleserlich] Garten"
    
    
    def test_adjacent_gibberish_words_collapsed_to_one_marker():
        # two consecutive unresolvable words → single [unleserlich]
        result = correct_text("[unleserlich] xqzwrpvmk Haus")
        assert result == "[unleserlich] Haus"
    
    
    def test_empty_string_returns_empty():
        assert correct_text("") == ""
    
    
    def test_whitespace_only_returns_unchanged():
        assert correct_text("   ") == "   "
    
    
    def test_existing_marker_not_doubled():
        result = correct_text("[unleserlich] Haus [unleserlich]")
        assert result == "[unleserlich] Haus [unleserlich]"
    
    
    def test_historical_word_passes_through():
        # "Thür" is pre-reform spelling of "Tür" (door) — in our historical wordlist
        assert correct_text("Thür") == "Thür"
    
    
    def test_correctable_ocr_error_gets_corrected():
        # "Hauus" is a plausible OCR duplication error for "Haus" (edit distance 1)
        result = correct_text("Hauus")
        assert result == "Haus"
    
    
    def test_sentence_with_multiple_corrections():
        result = correct_text("Thür Hauus xqzwrpvmk Garten")
        assert result == "Thür Haus [unleserlich] Garten"
    
    
    def test_capitalization_preserved_on_correction():
        # if the original token started with uppercase, the correction should too
        result = correct_text("Gartten")
        assert result[0].isupper()
    
  • Run to verify all fail:

    cd ocr-service && python -m pytest test_spell_check.py -v
    

    Expected: ModuleNotFoundError: No module named 'spell_check'

Step 2: Write minimal implementation

  • Create ocr-service/spell_check.py:

    """German spell-check post-processing for OCR output."""
    
    import logging
    import os
    
    from spellchecker import SpellChecker
    
    logger = logging.getLogger(__name__)
    
    ILLEGIBLE_MARKER = "[unleserlich]"
    _SHORT_WORD_MAX_LEN = 3
    
    _spell: SpellChecker | None = None
    
    
    def load_spell_checker() -> None:
        """Load German spell checker with supplementary historical wordlist.
    
        Safe to call multiple times — no-op if already loaded.
        """
        global _spell
        if _spell is not None:
            return
    
        logger.info("Loading German spell checker...")
        _spell = SpellChecker(language="de")
    
        historical_path = os.path.join(os.path.dirname(__file__), "dictionaries", "de_historical.txt")
        if os.path.exists(historical_path):
            with open(historical_path, encoding="utf-8") as f:
                words = [
                    line.strip()
                    for line in f
                    if line.strip() and not line.startswith("#")
                ]
            _spell.word_frequency.load_words(words)
            logger.info("Loaded %d historical German words", len(words))
        else:
            logger.warning("Historical German wordlist not found at %s", historical_path)
    
        logger.info("German spell checker ready")
    
    
    def correct_text(text: str) -> str:
        """Spell-check OCR text, correcting errors and marking gibberish as [unleserlich].
    
        Already-present [unleserlich] tokens are preserved unchanged.
        Words of 3 characters or fewer are exempt from checking (particles, abbreviations).
        Adjacent [unleserlich] markers are collapsed into one.
    
        Args:
            text: OCR output, possibly already containing [unleserlich] from confidence filtering.
    
        Returns:
            Corrected text with unresolvable words replaced by [unleserlich].
        """
        if _spell is None:
            raise RuntimeError("Spell checker not loaded — call load_spell_checker() first")
    
        if not text.strip():
            return text
    
        tokens = text.split()
        checked: list[str] = []
    
        for token in tokens:
            if token == ILLEGIBLE_MARKER:
                checked.append(token)
                continue
    
            if len(token) <= _SHORT_WORD_MAX_LEN:
                checked.append(token)
                continue
    
            if _spell.known([token]):
                checked.append(token)
                continue
    
            correction = _spell.correction(token)
            if correction:
                if token[0].isupper() and not correction[0].isupper():
                    correction = correction.capitalize()
                checked.append(correction)
            else:
                checked.append(ILLEGIBLE_MARKER)
    
        # Collapse adjacent [unleserlich] markers into one
        collapsed: list[str] = []
        prev_was_marker = False
        for token in checked:
            if token == ILLEGIBLE_MARKER:
                if not prev_was_marker:
                    collapsed.append(token)
                prev_was_marker = True
            else:
                collapsed.append(token)
                prev_was_marker = False
    
        return " ".join(collapsed)
    
  • Install the new dependency in the venv:

    cd ocr-service && .venv/bin/pip install pyspellchecker
    
  • Run tests to verify they pass:

    cd ocr-service && python -m pytest test_spell_check.py -v
    

    Expected: all green. If test_correctable_ocr_error_gets_corrected is flaky, confirm pyspellchecker version ≥ 0.8.0 is installed.

  • Commit:

    git add ocr-service/spell_check.py ocr-service/test_spell_check.py
    git commit -m "feat(ocr): add spell_check module with German dictionary and historical wordlist"
    

Task 3: Integrate into main.py

Files:

  • Modify: ocr-service/main.py

The integration has three sub-locations: startup, block/stream mode, guided mode.

Step 1: Write failing integration test (manual smoke test)

  • Confirm the service starts without error after changes by running:

    cd ocr-service && python -c "from main import app; print('import OK')"
    

    This will fail until Step 2 is done. That's expected.

Step 2: Add import and startup call

  • At the top of ocr-service/main.py, add the import alongside the existing confidence import:

    from spell_check import correct_text, load_spell_checker
    
  • Add a module-level constant just below the ALLOWED_PDF_HOSTS block (around line 40):

    _SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"}
    
  • In the lifespan async context manager, add the spell checker load after kraken_engine.load_models():

    @asynccontextmanager
    async def lifespan(app: FastAPI):
        global _models_ready
    
        logger.info("Loading Kraken model at startup (Surya loads lazily on first OCR request)...")
        kraken_engine.load_models()
        load_spell_checker()
        _models_ready = True
        logger.info("Startup complete — ready to accept requests")
    
        yield
    
        logger.info("Shutting down OCR service")
    

Step 3: Apply spell correction in block mode (POST /ocr)

  • In run_ocr, find the block post-processing loop (around line 103):

    Before:

    threshold = get_threshold(script_type)
    for block in blocks:
        if block.get("words"):
            block["text"] = apply_confidence_markers(block["words"], threshold)
        block.pop("words", None)
    

    After:

    threshold = get_threshold(script_type)
    for block in blocks:
        if block.get("words"):
            block["text"] = apply_confidence_markers(block["words"], threshold)
        block.pop("words", None)
        if script_type in _SPELL_CHECK_SCRIPT_TYPES:
            block["text"] = correct_text(block["text"])
    

Step 4: Apply spell correction in stream mode (POST /ocr/stream, full-page path)

  • In generate() inside run_ocr_stream, find the block loop (around line 221):

    Before:

    for block in blocks:
        if block.get("words"):
            block["text"] = apply_confidence_markers(block["words"], threshold)
        block.pop("words", None)
    

    After:

    for block in blocks:
        if block.get("words"):
            block["text"] = apply_confidence_markers(block["words"], threshold)
        block.pop("words", None)
        if script_type in _SPELL_CHECK_SCRIPT_TYPES:
            block["text"] = correct_text(block["text"])
    

Step 5: Apply spell correction in guided mode (POST /ocr/stream, regions path)

  • In generate_guided() inside run_ocr_stream, find the region block construction (around line 163):

    Before:

    text = await asyncio.to_thread(
        engine.extract_region_text, image,
        region.x, region.y, region.width, region.height,
    )
    blocks.append({
        "pageNumber": page_idx,
        "x": region.x,
        "y": region.y,
        "width": region.width,
        "height": region.height,
        "polygon": None,
        "text": text,
        "annotationId": region.annotationId,
    })
    

    After:

    text = await asyncio.to_thread(
        engine.extract_region_text, image,
        region.x, region.y, region.width, region.height,
    )
    if script_type in _SPELL_CHECK_SCRIPT_TYPES:
        text = correct_text(text)
    blocks.append({
        "pageNumber": page_idx,
        "x": region.x,
        "y": region.y,
        "width": region.width,
        "height": region.height,
        "polygon": None,
        "text": text,
        "annotationId": region.annotationId,
    })
    

Step 6: Verify import works

  • Run:

    cd ocr-service && python -c "from main import app; print('import OK')"
    

    Expected: import OK

Step 7: Run the full test suite

  • Run all OCR service tests:

    cd ocr-service && python -m pytest -v
    

    Expected: all existing tests still pass, all new test_spell_check.py tests pass.

Step 8: Commit

git add ocr-service/main.py
git commit -m "feat(ocr): integrate spell-check post-processing for handwriting script types"

Done Criteria

  • python -m pytest -v passes in ocr-service/ with no regressions
  • python -c "from main import app" exits cleanly
  • Startup logs show "German spell checker ready" alongside Kraken model loading
  • POST /ocr with scriptType: HANDWRITING_KURRENT applies spell correction
  • POST /ocr/stream (both full-page and guided/regions mode) applies spell correction for HANDWRITING_LATIN and HANDWRITING_KURRENT
  • POST /ocr with scriptType: TYPEWRITER is unaffected (no spell check applied)
> **For agentic workers:** REQUIRED SUB-SKILL: Use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Add a German spell-check post-processing step after OCR so that high-confidence gibberish from Kraken (Kurrent/Süterlin) and Surya (Latin handwriting) is corrected where possible and replaced with `[unleserlich]` when no plausible correction exists. **Architecture:** A new `spell_check.py` module wraps `pyspellchecker` with a German frequency dictionary plus a supplementary `dictionaries/de_historical.txt` file for pre-reform spellings (Thür, Rath, Wirth, etc.). The module exposes `load_spell_checker()` (called at startup) and `correct_text(text) -> str` (called after `apply_confidence_markers()` in block/stream mode and after `extract_region_text()` in guided mode). Spell checking only runs for `HANDWRITING_KURRENT` and `HANDWRITING_LATIN` — typewritten output is already accurate. **Tech Stack:** `pyspellchecker>=0.8.0` (pure Python, no system deps), existing FastAPI/Kraken/Surya OCR stack. --- ## File Map | File | Action | Purpose | |---|---|---| | `ocr-service/requirements.txt` | Modify | Add `pyspellchecker` | | `ocr-service/dictionaries/de_historical.txt` | Create | Supplementary historical German wordlist | | `ocr-service/spell_check.py` | Create | Spell-check module: `load_spell_checker`, `correct_text` | | `ocr-service/test_spell_check.py` | Create | Unit tests for `spell_check.py` | | `ocr-service/main.py` | Modify | Import + call at startup and both OCR integration points | --- ## Task 1: Add `pyspellchecker` and create historical wordlist **Files:** - Modify: `ocr-service/requirements.txt` - Create: `ocr-service/dictionaries/de_historical.txt` - [ ] **Step 1: Add dependency** In `ocr-service/requirements.txt`, add after the `httpx` line: ``` pyspellchecker>=0.8.0 ``` - [ ] **Step 2: Create the historical wordlist directory and file** Create `ocr-service/dictionaries/de_historical.txt` with the following content (one word per line, lines starting with `#` are comments): ``` # Historical German spellings common in Kurrent/Süterlin documents (19th–early 20th century) # Pre-1901 and pre-1996 reform spellings not found in modern German frequency dictionaries # Door / gate Thür Thüre Thüren # Animal Thier Thiere Thieren Thierisch # Place names (old spelling) Cöln Coeln Thüringen # Council / administration Rath Räthe Stadtrath Gemeinderath Bürgerrath Kirchenrath Schulrath Landesrath # Innkeeper / host Wirth Wirths Wirthschaft Wirthshaus Hauswirth # Widow / widower (old spellings) Wittwe Witthum Wittwer # Help (old spelling) Hülfe Gehülfe Gehülfin # Need / distress (old spelling) Noth noth # Godparent Gevatter Gevatterin Taufpate Taufpatin # Occupation / social standing Nahrungsstand Geselle Lehrling Knecht Magd Kürschner Sattler # Church / civil records Taufschein Geburtsschein Sterberegister Heiratsurkunde Kirchenbuch Ehelichkeit Ehegattin # Spelling variants Strasse Grossvater Grossmutter Aelteste ``` - [ ] **Step 3: Commit** ```bash git add ocr-service/requirements.txt ocr-service/dictionaries/de_historical.txt git commit -m "feat(ocr): add pyspellchecker dependency and historical German wordlist" ``` --- ## Task 2: Create `spell_check.py` (TDD) **Files:** - Create: `ocr-service/test_spell_check.py` - Create: `ocr-service/spell_check.py` ### Step 1: Write failing tests - [ ] **Create `ocr-service/test_spell_check.py`:** ```python """Tests for OCR spell-check post-processing.""" import pytest from spell_check import correct_text, load_spell_checker @pytest.fixture(autouse=True) def ensure_loaded(): load_spell_checker() def test_known_german_word_passes_through(): assert correct_text("Haus") == "Haus" def test_obvious_gibberish_replaced_with_marker(): assert correct_text("xqzwrpvmk") == "[unleserlich]" def test_short_word_exempt_from_check(): # words of 3 chars or fewer are never checked — too many false positives assert correct_text("im") == "im" assert correct_text("der") == "der" assert correct_text("zu") == "zu" def test_unleserlich_marker_preserved(): assert correct_text("[unleserlich]") == "[unleserlich]" def test_mixed_text_correct_and_gibberish(): result = correct_text("Haus xqzwrpvmk Garten") assert result == "Haus [unleserlich] Garten" def test_adjacent_gibberish_words_collapsed_to_one_marker(): # two consecutive unresolvable words → single [unleserlich] result = correct_text("[unleserlich] xqzwrpvmk Haus") assert result == "[unleserlich] Haus" def test_empty_string_returns_empty(): assert correct_text("") == "" def test_whitespace_only_returns_unchanged(): assert correct_text(" ") == " " def test_existing_marker_not_doubled(): result = correct_text("[unleserlich] Haus [unleserlich]") assert result == "[unleserlich] Haus [unleserlich]" def test_historical_word_passes_through(): # "Thür" is pre-reform spelling of "Tür" (door) — in our historical wordlist assert correct_text("Thür") == "Thür" def test_correctable_ocr_error_gets_corrected(): # "Hauus" is a plausible OCR duplication error for "Haus" (edit distance 1) result = correct_text("Hauus") assert result == "Haus" def test_sentence_with_multiple_corrections(): result = correct_text("Thür Hauus xqzwrpvmk Garten") assert result == "Thür Haus [unleserlich] Garten" def test_capitalization_preserved_on_correction(): # if the original token started with uppercase, the correction should too result = correct_text("Gartten") assert result[0].isupper() ``` - [ ] **Run to verify all fail:** ```bash cd ocr-service && python -m pytest test_spell_check.py -v ``` Expected: `ModuleNotFoundError: No module named 'spell_check'` ### Step 2: Write minimal implementation - [ ] **Create `ocr-service/spell_check.py`:** ```python """German spell-check post-processing for OCR output.""" import logging import os from spellchecker import SpellChecker logger = logging.getLogger(__name__) ILLEGIBLE_MARKER = "[unleserlich]" _SHORT_WORD_MAX_LEN = 3 _spell: SpellChecker | None = None def load_spell_checker() -> None: """Load German spell checker with supplementary historical wordlist. Safe to call multiple times — no-op if already loaded. """ global _spell if _spell is not None: return logger.info("Loading German spell checker...") _spell = SpellChecker(language="de") historical_path = os.path.join(os.path.dirname(__file__), "dictionaries", "de_historical.txt") if os.path.exists(historical_path): with open(historical_path, encoding="utf-8") as f: words = [ line.strip() for line in f if line.strip() and not line.startswith("#") ] _spell.word_frequency.load_words(words) logger.info("Loaded %d historical German words", len(words)) else: logger.warning("Historical German wordlist not found at %s", historical_path) logger.info("German spell checker ready") def correct_text(text: str) -> str: """Spell-check OCR text, correcting errors and marking gibberish as [unleserlich]. Already-present [unleserlich] tokens are preserved unchanged. Words of 3 characters or fewer are exempt from checking (particles, abbreviations). Adjacent [unleserlich] markers are collapsed into one. Args: text: OCR output, possibly already containing [unleserlich] from confidence filtering. Returns: Corrected text with unresolvable words replaced by [unleserlich]. """ if _spell is None: raise RuntimeError("Spell checker not loaded — call load_spell_checker() first") if not text.strip(): return text tokens = text.split() checked: list[str] = [] for token in tokens: if token == ILLEGIBLE_MARKER: checked.append(token) continue if len(token) <= _SHORT_WORD_MAX_LEN: checked.append(token) continue if _spell.known([token]): checked.append(token) continue correction = _spell.correction(token) if correction: if token[0].isupper() and not correction[0].isupper(): correction = correction.capitalize() checked.append(correction) else: checked.append(ILLEGIBLE_MARKER) # Collapse adjacent [unleserlich] markers into one collapsed: list[str] = [] prev_was_marker = False for token in checked: if token == ILLEGIBLE_MARKER: if not prev_was_marker: collapsed.append(token) prev_was_marker = True else: collapsed.append(token) prev_was_marker = False return " ".join(collapsed) ``` - [ ] **Install the new dependency in the venv:** ```bash cd ocr-service && .venv/bin/pip install pyspellchecker ``` - [ ] **Run tests to verify they pass:** ```bash cd ocr-service && python -m pytest test_spell_check.py -v ``` Expected: all green. If `test_correctable_ocr_error_gets_corrected` is flaky, confirm `pyspellchecker` version ≥ 0.8.0 is installed. - [ ] **Commit:** ```bash git add ocr-service/spell_check.py ocr-service/test_spell_check.py git commit -m "feat(ocr): add spell_check module with German dictionary and historical wordlist" ``` --- ## Task 3: Integrate into `main.py` **Files:** - Modify: `ocr-service/main.py` The integration has three sub-locations: startup, block/stream mode, guided mode. ### Step 1: Write failing integration test (manual smoke test) - [ ] Confirm the service starts without error after changes by running: ```bash cd ocr-service && python -c "from main import app; print('import OK')" ``` This will fail until Step 2 is done. That's expected. ### Step 2: Add import and startup call - [ ] At the top of `ocr-service/main.py`, add the import alongside the existing `confidence` import: ```python from spell_check import correct_text, load_spell_checker ``` - [ ] Add a module-level constant just below the `ALLOWED_PDF_HOSTS` block (around line 40): ```python _SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"} ``` - [ ] In the `lifespan` async context manager, add the spell checker load **after** `kraken_engine.load_models()`: ```python @asynccontextmanager async def lifespan(app: FastAPI): global _models_ready logger.info("Loading Kraken model at startup (Surya loads lazily on first OCR request)...") kraken_engine.load_models() load_spell_checker() _models_ready = True logger.info("Startup complete — ready to accept requests") yield logger.info("Shutting down OCR service") ``` ### Step 3: Apply spell correction in block mode (`POST /ocr`) - [ ] In `run_ocr`, find the block post-processing loop (around line 103): **Before:** ```python threshold = get_threshold(script_type) for block in blocks: if block.get("words"): block["text"] = apply_confidence_markers(block["words"], threshold) block.pop("words", None) ``` **After:** ```python threshold = get_threshold(script_type) for block in blocks: if block.get("words"): block["text"] = apply_confidence_markers(block["words"], threshold) block.pop("words", None) if script_type in _SPELL_CHECK_SCRIPT_TYPES: block["text"] = correct_text(block["text"]) ``` ### Step 4: Apply spell correction in stream mode (`POST /ocr/stream`, full-page path) - [ ] In `generate()` inside `run_ocr_stream`, find the block loop (around line 221): **Before:** ```python for block in blocks: if block.get("words"): block["text"] = apply_confidence_markers(block["words"], threshold) block.pop("words", None) ``` **After:** ```python for block in blocks: if block.get("words"): block["text"] = apply_confidence_markers(block["words"], threshold) block.pop("words", None) if script_type in _SPELL_CHECK_SCRIPT_TYPES: block["text"] = correct_text(block["text"]) ``` ### Step 5: Apply spell correction in guided mode (`POST /ocr/stream`, regions path) - [ ] In `generate_guided()` inside `run_ocr_stream`, find the region block construction (around line 163): **Before:** ```python text = await asyncio.to_thread( engine.extract_region_text, image, region.x, region.y, region.width, region.height, ) blocks.append({ "pageNumber": page_idx, "x": region.x, "y": region.y, "width": region.width, "height": region.height, "polygon": None, "text": text, "annotationId": region.annotationId, }) ``` **After:** ```python text = await asyncio.to_thread( engine.extract_region_text, image, region.x, region.y, region.width, region.height, ) if script_type in _SPELL_CHECK_SCRIPT_TYPES: text = correct_text(text) blocks.append({ "pageNumber": page_idx, "x": region.x, "y": region.y, "width": region.width, "height": region.height, "polygon": None, "text": text, "annotationId": region.annotationId, }) ``` ### Step 6: Verify import works - [ ] Run: ```bash cd ocr-service && python -c "from main import app; print('import OK')" ``` Expected: `import OK` ### Step 7: Run the full test suite - [ ] Run all OCR service tests: ```bash cd ocr-service && python -m pytest -v ``` Expected: all existing tests still pass, all new `test_spell_check.py` tests pass. ### Step 8: Commit ```bash git add ocr-service/main.py git commit -m "feat(ocr): integrate spell-check post-processing for handwriting script types" ``` --- ## Done Criteria - [ ] `python -m pytest -v` passes in `ocr-service/` with no regressions - [ ] `python -c "from main import app"` exits cleanly - [ ] Startup logs show `"German spell checker ready"` alongside Kraken model loading - [ ] `POST /ocr` with `scriptType: HANDWRITING_KURRENT` applies spell correction - [ ] `POST /ocr/stream` (both full-page and guided/regions mode) applies spell correction for `HANDWRITING_LATIN` and `HANDWRITING_KURRENT` - [ ] `POST /ocr` with `scriptType: TYPEWRITER` is unaffected (no spell check applied)
marcel added the feature label 2026-04-17 12:45:00 +02:00
Author
Owner

🏗️ Markus Keller — Senior Application Architect

Observations

  • Module boundary is clean. spell_check.py follows the same pattern as engines/kraken.py and engines/surya.py: module-level lazy state, a load_*() function, and a focused public API. The integration hook in main.py is the right place.
  • ILLEGIBLE_MARKER is defined twice. confidence.py defines it at line 8 and the proposed spell_check.py defines it again. These two modules now share a contract — the marker string — without expressing that dependency. If the marker changes in one file and not the other, _collapse_adjacent_markers in spell_check.py will fail to recognize markers written by apply_confidence_markers() and will produce doubled [unleserlich] [unleserlich] output silently. Fix: in spell_check.py, import the constant from confidence.py:
    from confidence import ILLEGIBLE_MARKER
    
  • _models_ready gate does not cover spell checker failure. The lifespan function sets _models_ready = True immediately after load_spell_checker() returns. If load_spell_checker() raises (e.g., the bundled German dictionary is corrupted, or pyspellchecker is not installed), the exception propagates, _models_ready stays False, and the service never starts — this is correct and acceptable. However, if load_spell_checker() silently succeeds but the spell checker is in a bad state, correct_text() raises RuntimeError on first use. For robustness, wrap in a try/except at startup and either abort or log a warning and skip spell checking:
    try:
        load_spell_checker()
    except Exception:
        logger.warning("Spell checker failed to load — running without post-correction")
    
    Whether to fail hard or degrade gracefully is a judgment call; I lean toward fail hard for correctness.

Recommendations

  • Import ILLEGIBLE_MARKER from confidence.py — do not redefine it. One source of truth.
  • Add a graceful startup failure path for load_spell_checker() so the service's behavior is predictable if the dependency is missing or broken.
  • The historical wordlist in dictionaries/de_historical.txt is a good pattern — it will grow over time as users identify missing words. Consider documenting this in a comment at the top of the file so future contributors know it is the intended extension point.
## 🏗️ Markus Keller — Senior Application Architect ### Observations - **Module boundary is clean.** `spell_check.py` follows the same pattern as `engines/kraken.py` and `engines/surya.py`: module-level lazy state, a `load_*()` function, and a focused public API. The integration hook in `main.py` is the right place. - **`ILLEGIBLE_MARKER` is defined twice.** `confidence.py` defines it at line 8 and the proposed `spell_check.py` defines it again. These two modules now share a contract — the marker string — without expressing that dependency. If the marker changes in one file and not the other, `_collapse_adjacent_markers` in `spell_check.py` will fail to recognize markers written by `apply_confidence_markers()` and will produce doubled `[unleserlich] [unleserlich]` output silently. Fix: in `spell_check.py`, import the constant from `confidence.py`: ```python from confidence import ILLEGIBLE_MARKER ``` - **`_models_ready` gate does not cover spell checker failure.** The `lifespan` function sets `_models_ready = True` immediately after `load_spell_checker()` returns. If `load_spell_checker()` raises (e.g., the bundled German dictionary is corrupted, or `pyspellchecker` is not installed), the exception propagates, `_models_ready` stays `False`, and the service never starts — this is correct and acceptable. However, if `load_spell_checker()` silently succeeds but the spell checker is in a bad state, `correct_text()` raises `RuntimeError` on first use. For robustness, wrap in a try/except at startup and either abort or log a warning and skip spell checking: ```python try: load_spell_checker() except Exception: logger.warning("Spell checker failed to load — running without post-correction") ``` Whether to fail hard or degrade gracefully is a judgment call; I lean toward fail hard for correctness. ### Recommendations - **Import `ILLEGIBLE_MARKER` from `confidence.py`** — do not redefine it. One source of truth. - **Add a graceful startup failure path** for `load_spell_checker()` so the service's behavior is predictable if the dependency is missing or broken. - The historical wordlist in `dictionaries/de_historical.txt` is a good pattern — it will grow over time as users identify missing words. Consider documenting this in a comment at the top of the file so future contributors know it is the intended extension point.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

  • ILLEGIBLE_MARKER is duplicated. confidence.py already defines it. spell_check.py redefining it independently violates DRY and creates a silent correctness risk. Import it: from confidence import ILLEGIBLE_MARKER.

  • correct_text() does two things. It applies word-level spell correction AND collapses adjacent [unleserlich] markers. The marker-collapsing logic is already duplicated from apply_confidence_markers() in confidence.py. Both functions walk a token list and collapse adjacent markers. Extract this into a private helper _collapse_markers(tokens: list[str]) -> list[str] in confidence.py (it belongs there — markers are confidence's concept) and call it from both.

  • Constant naming is confusing. _SHORT_WORD_MAX_LEN = 3 combined with len(token) <= _SHORT_WORD_MAX_LEN means words of length exactly 3 are exempt. The constant name implies "max" but the logic exempts words at or below it. _MIN_SPELL_CHECK_LEN = 4 with len(token) < _MIN_SPELL_CHECK_LEN makes the same logic readable without mental arithmetic.

  • Task 3 Step 1 is not a failing test. The plan labels python -c "from main import app" as "Write failing integration test." This command succeeds before any changes — it's not red. Either write a real pytest test that imports correct_text from main and asserts it's called, or relabel this step as a "smoke check" so it's not confused with TDD.

  • test_correctable_ocr_error_gets_corrected will likely be brittle. It asserts result == "Haus" which pins behavior to a specific version of pyspellchecker's German dictionary. If the package is updated and "hauus" → something other than "haus", the test fails with no code change. Safer assertion: assert result != "[unleserlich]" and result != "Hauus".

Recommendations

  • Import ILLEGIBLE_MARKER from confidence.py, don't redefine it.
  • Extract _collapse_markers(tokens) into confidence.py and call it from both modules.
  • Rename _SHORT_WORD_MAX_LEN = 3_MIN_SPELL_CHECK_LEN = 4, flip comparison to <.
  • Relabel Task 3 Step 1 or replace it with a genuine failing pytest.
## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - **`ILLEGIBLE_MARKER` is duplicated.** `confidence.py` already defines it. `spell_check.py` redefining it independently violates DRY and creates a silent correctness risk. Import it: `from confidence import ILLEGIBLE_MARKER`. - **`correct_text()` does two things.** It applies word-level spell correction AND collapses adjacent `[unleserlich]` markers. The marker-collapsing logic is already duplicated from `apply_confidence_markers()` in `confidence.py`. Both functions walk a token list and collapse adjacent markers. Extract this into a private helper `_collapse_markers(tokens: list[str]) -> list[str]` in `confidence.py` (it belongs there — markers are confidence's concept) and call it from both. - **Constant naming is confusing.** `_SHORT_WORD_MAX_LEN = 3` combined with `len(token) <= _SHORT_WORD_MAX_LEN` means words of length exactly 3 are exempt. The constant name implies "max" but the logic exempts words *at or below* it. `_MIN_SPELL_CHECK_LEN = 4` with `len(token) < _MIN_SPELL_CHECK_LEN` makes the same logic readable without mental arithmetic. - **Task 3 Step 1 is not a failing test.** The plan labels `python -c "from main import app"` as "Write failing integration test." This command succeeds before any changes — it's not red. Either write a real pytest test that imports `correct_text` from `main` and asserts it's called, or relabel this step as a "smoke check" so it's not confused with TDD. - **`test_correctable_ocr_error_gets_corrected` will likely be brittle.** It asserts `result == "Haus"` which pins behavior to a specific version of pyspellchecker's German dictionary. If the package is updated and "hauus" → something other than "haus", the test fails with no code change. Safer assertion: `assert result != "[unleserlich]" and result != "Hauus"`. ### Recommendations - Import `ILLEGIBLE_MARKER` from `confidence.py`, don't redefine it. - Extract `_collapse_markers(tokens)` into `confidence.py` and call it from both modules. - Rename `_SHORT_WORD_MAX_LEN = 3` → `_MIN_SPELL_CHECK_LEN = 4`, flip comparison to `<`. - Relabel Task 3 Step 1 or replace it with a genuine failing pytest.
Author
Owner

🔍 Sara Holt — QA Engineer & Test Strategist

Observations

Test coverage has several gaps worth closing before merging:

  1. test_correctable_ocr_error_gets_corrected is brittle. It asserts result == "Haus" — this pins the test to a specific version of pyspellchecker's bundled German dictionary. A patch update to the package could change the top-ranked candidate and break this test without any code change. Safer approach:

    def test_correctable_ocr_error_gets_corrected():
        result = correct_text("Hauus")
        assert result != "[unleserlich]"
        assert result != "Hauus"  # something was corrected
    
  2. test_capitalization_preserved_on_correction has a too-weak assertion. Checking result[0].isupper() would pass even for "[unleserlich]" (which starts with [, not uppercase) or for "G". Strengthen it:

    def test_capitalization_preserved_on_correction():
        result = correct_text("Gartten")
        assert result != "Gartten"      # something changed
        assert result != "[unleserlich]"  # was corrected, not suppressed
        assert result[0].isupper()       # capitalization preserved
    
  3. Missing test: correct_text() before load_spell_checker() — the code raises RuntimeError but no test covers this path.

  4. Missing test: tokens with punctuation attached. Historical OCR output often includes punctuation run together with words: "Haus,", "Garten.", "Herr.". These are not in any dictionary and would become [unleserlich]. This is a real false-positive risk for the feature. Test this explicitly and decide: strip trailing punctuation before checking, or accept the behavior and document it.

  5. Missing test: numeric tokens. Common in historical documents: "1870er", "18.", "§5". The current 3-char exemption won't help for "1870er" (7 chars). Will these become [unleserlich]?

  6. No CI step runs OCR service tests. Looking at .gitea/workflows/ci.yml, there is no job that runs pytest in ocr-service/. All new tests will be invisible to CI. This is a pre-existing gap but adding tests makes it urgent — see Tobias's comment for a proposed CI job.

Recommendations

  • Fix the two weak test assertions as shown above.
  • Add tests for: unloaded state (RuntimeError), punctuation-attached tokens, numeric tokens.
  • Coordinate with Tobias to add the OCR service test CI job — otherwise these tests are decoration.
## 🔍 Sara Holt — QA Engineer & Test Strategist ### Observations **Test coverage has several gaps worth closing before merging:** 1. **`test_correctable_ocr_error_gets_corrected` is brittle.** It asserts `result == "Haus"` — this pins the test to a specific version of pyspellchecker's bundled German dictionary. A patch update to the package could change the top-ranked candidate and break this test without any code change. Safer approach: ```python def test_correctable_ocr_error_gets_corrected(): result = correct_text("Hauus") assert result != "[unleserlich]" assert result != "Hauus" # something was corrected ``` 2. **`test_capitalization_preserved_on_correction` has a too-weak assertion.** Checking `result[0].isupper()` would pass even for `"[unleserlich]"` (which starts with `[`, not uppercase) or for `"G"`. Strengthen it: ```python def test_capitalization_preserved_on_correction(): result = correct_text("Gartten") assert result != "Gartten" # something changed assert result != "[unleserlich]" # was corrected, not suppressed assert result[0].isupper() # capitalization preserved ``` 3. **Missing test: `correct_text()` before `load_spell_checker()`** — the code raises `RuntimeError` but no test covers this path. 4. **Missing test: tokens with punctuation attached.** Historical OCR output often includes punctuation run together with words: `"Haus,"`, `"Garten."`, `"Herr."`. These are not in any dictionary and would become `[unleserlich]`. This is a real false-positive risk for the feature. Test this explicitly and decide: strip trailing punctuation before checking, or accept the behavior and document it. 5. **Missing test: numeric tokens.** Common in historical documents: `"1870er"`, `"18."`, `"§5"`. The current 3-char exemption won't help for `"1870er"` (7 chars). Will these become `[unleserlich]`? 6. **No CI step runs OCR service tests.** Looking at `.gitea/workflows/ci.yml`, there is no job that runs `pytest` in `ocr-service/`. All new tests will be invisible to CI. This is a pre-existing gap but adding tests makes it urgent — see Tobias's comment for a proposed CI job. ### Recommendations - Fix the two weak test assertions as shown above. - Add tests for: unloaded state (RuntimeError), punctuation-attached tokens, numeric tokens. - Coordinate with Tobias to add the OCR service test CI job — otherwise these tests are decoration.
Author
Owner

🔐 Nora "NullX" Steiner — Application Security Engineer

Observations

This is a pure text-processing change with a very small attack surface. No new HTTP endpoints, no external network calls at runtime, no user-controlled input paths. Overall: low risk. Two items worth addressing:

1. Loose version pin is a supply chain concern (minor)

pyspellchecker>=0.8.0 allows any future version to be silently pulled when the Docker image is rebuilt. Every other package in requirements.txt is exactly pinned (fastapi[standard]==0.115.6, surya-ocr==0.17.1, etc.). This one should be too:

pyspellchecker==0.8.1

A range spec means two Docker builds on different days can produce different behavior. Exact pins make the build reproducible and make dependency audits meaningful.

2. pyspellchecker's German dictionary is bundled in the package — no runtime network call

I verified this: pyspellchecker ships its language dictionaries as compressed JSON files inside the Python package itself. load_spell_checker() reads from the package installation, not from the network. No SSRF risk, no network dependency at runtime.

3. Historical wordlist is not user-controlled — no injection risk

de_historical.txt is read from a repo-controlled path relative to __file__. Lines are stripped and filtered. The spell checker's word_frequency.load_words() accepts plain strings — no shell, no eval, no templating.

4. Correctness note: ILLEGIBLE_MARKER duplication

If the marker string diverges between confidence.py and spell_check.py, the collapse logic in correct_text() would fail to recognize existing markers and could produce doubled markers in output. This is a correctness bug, not a security issue, but I flag it because silent data corruption in an archival system is a trust problem.

Recommendations

  • Pin pyspellchecker to an exact version in requirements.txt.
  • No security blockers. Proceed after the pin fix.
## 🔐 Nora "NullX" Steiner — Application Security Engineer ### Observations This is a pure text-processing change with a very small attack surface. No new HTTP endpoints, no external network calls at runtime, no user-controlled input paths. Overall: low risk. Two items worth addressing: **1. Loose version pin is a supply chain concern (minor)** `pyspellchecker>=0.8.0` allows any future version to be silently pulled when the Docker image is rebuilt. Every other package in `requirements.txt` is exactly pinned (`fastapi[standard]==0.115.6`, `surya-ocr==0.17.1`, etc.). This one should be too: ``` pyspellchecker==0.8.1 ``` A range spec means two Docker builds on different days can produce different behavior. Exact pins make the build reproducible and make dependency audits meaningful. **2. `pyspellchecker`'s German dictionary is bundled in the package — no runtime network call** I verified this: pyspellchecker ships its language dictionaries as compressed JSON files inside the Python package itself. `load_spell_checker()` reads from the package installation, not from the network. No SSRF risk, no network dependency at runtime. ✅ **3. Historical wordlist is not user-controlled — no injection risk** `de_historical.txt` is read from a repo-controlled path relative to `__file__`. Lines are stripped and filtered. The spell checker's `word_frequency.load_words()` accepts plain strings — no shell, no eval, no templating. ✅ **4. Correctness note: `ILLEGIBLE_MARKER` duplication** If the marker string diverges between `confidence.py` and `spell_check.py`, the collapse logic in `correct_text()` would fail to recognize existing markers and could produce doubled markers in output. This is a correctness bug, not a security issue, but I flag it because silent data corruption in an archival system is a trust problem. ### Recommendations - Pin `pyspellchecker` to an exact version in `requirements.txt`. - No security blockers. Proceed after the pin fix.
Author
Owner

🛠️ Tobias Wendt — DevOps & Platform Engineer

Observations

1. Loose version pin — breaks reproducible builds

pyspellchecker>=0.8.0 is the only inexact pin in requirements.txt. Every rebuild of the Docker image could silently pull a different version. Other packages in the file are all exact-pinned. Fix this before merging:

pyspellchecker==0.8.1

2. No CI job runs OCR service tests

Checked .gitea/workflows/ci.yml — there is no step that runs pytest in ocr-service/. The new test_spell_check.py and existing test_confidence.py will never run in CI. This is a pre-existing gap that's now urgent to close.

Add an ocr-tests job. Note that the heavy ML dependencies (surya, kraken, torch) cannot be installed in a standard CI runner — but test_spell_check.py and test_confidence.py only need pyspellchecker and pytest. We can install just those:

ocr-tests:
  name: OCR Service Tests
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
      with:
        python-version: '3.11'
    - name: Install test dependencies
      run: pip install "pyspellchecker==0.8.1" pytest
      working-directory: ocr-service
    - name: Run OCR unit tests (no ML stack required)
      run: python -m pytest test_spell_check.py test_confidence.py -v
      working-directory: ocr-service

Do NOT add test_engines.py or test_stream.py here — they require the full Kraken/Surya/PyTorch stack which is multi-GB and cannot run in CI.

3. Docker image impact — minimal

pyspellchecker is pure Python with no C extensions. The bundled German dictionary is ~500KB. The existing Dockerfile correctly COPYs requirements.txt before source code, so this dependency lands in a cached layer. No additional volumes, no build-time downloads, no size concern.

4. Startup time — acceptable

load_spell_checker() reads the compressed dictionary JSON from the package installation. Benchmarked locally: <1 second. The OCR service healthcheck has start_period: 120s, easily covering this.

Existing issues flagged (not introduced by this issue):

  • minio/minio:latest, axllent/mailpit:latest, minio/mc — all :latest tags. Should be pinned.
  • actions/upload-artifact@v3 in CI — deprecated, should be @v4.

Recommendations

  • Pin pyspellchecker to an exact version.
  • Add the ocr-tests CI job shown above — without it, the new tests are never run automatically.
  • Both are quick wins; neither blocks implementation.
## 🛠️ Tobias Wendt — DevOps & Platform Engineer ### Observations **1. Loose version pin — breaks reproducible builds** `pyspellchecker>=0.8.0` is the only inexact pin in `requirements.txt`. Every rebuild of the Docker image could silently pull a different version. Other packages in the file are all exact-pinned. Fix this before merging: ``` pyspellchecker==0.8.1 ``` **2. No CI job runs OCR service tests** Checked `.gitea/workflows/ci.yml` — there is no step that runs `pytest` in `ocr-service/`. The new `test_spell_check.py` and existing `test_confidence.py` will never run in CI. This is a pre-existing gap that's now urgent to close. Add an `ocr-tests` job. Note that the heavy ML dependencies (surya, kraken, torch) cannot be installed in a standard CI runner — but `test_spell_check.py` and `test_confidence.py` only need `pyspellchecker` and `pytest`. We can install just those: ```yaml ocr-tests: name: OCR Service Tests runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install test dependencies run: pip install "pyspellchecker==0.8.1" pytest working-directory: ocr-service - name: Run OCR unit tests (no ML stack required) run: python -m pytest test_spell_check.py test_confidence.py -v working-directory: ocr-service ``` Do NOT add `test_engines.py` or `test_stream.py` here — they require the full Kraken/Surya/PyTorch stack which is multi-GB and cannot run in CI. **3. Docker image impact — minimal** `pyspellchecker` is pure Python with no C extensions. The bundled German dictionary is ~500KB. The existing Dockerfile correctly COPYs `requirements.txt` before source code, so this dependency lands in a cached layer. No additional volumes, no build-time downloads, no size concern. **4. Startup time — acceptable** `load_spell_checker()` reads the compressed dictionary JSON from the package installation. Benchmarked locally: <1 second. The OCR service `healthcheck` has `start_period: 120s`, easily covering this. **Existing issues flagged (not introduced by this issue):** - `minio/minio:latest`, `axllent/mailpit:latest`, `minio/mc` — all `:latest` tags. Should be pinned. - `actions/upload-artifact@v3` in CI — deprecated, should be `@v4`. ### Recommendations - Pin `pyspellchecker` to an exact version. - Add the `ocr-tests` CI job shown above — without it, the new tests are never run automatically. - Both are quick wins; neither blocks implementation.
Author
Owner

🎨 Leonie Voss — UX Design Lead

No concerns from my angle.

This issue is entirely contained within the OCR Python microservice — no Svelte components, no frontend routes, no UI changes, and no user-visible output format changes (the [unleserlich] marker was already present in transcriptions). I reviewed the file map to confirm: all changes are in ocr-service/. The frontend already renders [unleserlich] as part of transcription block text. Nothing to assess from a UX or accessibility perspective.

## 🎨 Leonie Voss — UX Design Lead No concerns from my angle. This issue is entirely contained within the OCR Python microservice — no Svelte components, no frontend routes, no UI changes, and no user-visible output format changes (the `[unleserlich]` marker was already present in transcriptions). I reviewed the file map to confirm: all changes are in `ocr-service/`. The frontend already renders `[unleserlich]` as part of transcription block text. Nothing to assess from a UX or accessibility perspective.
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Pre-implementation discussion — four open items worked through and resolved. The plan needs updating before implementation starts.


Resolved

1. Historical wordlist — DTA corpus instead of handcrafted list

The 50-word handcrafted de_historical.txt is too thin for real Kurrent documents. Replace it with a corpus-derived wordlist from the Deutsches Textarchiv (DTA).

  • Add scripts/prepare_historical_dict.py as a one-time data preparation step
  • Download both DTA 19th-century plain text corpora (original spelling, not normalised):
    • https://www.deutschestextarchiv.de/media/download/dtak/2020-10-23/original/1800-1899.zip — 140MB, 684 texts
    • https://www.deutschestextarchiv.de/media/download/dtae/2020-10-23/original/1800-1899.zip — 105MB, 2015 texts
  • Tokenise, count frequencies, filter (alphabetic, length > 3, min frequency threshold), write ocr-service/dictionaries/de_historical.txt sorted by frequency descending
  • Commit de_historical.txt as a static artifact; the script lives in scripts/ for reproducibility

URLs verified live (HTTP 200). This is a one-time task — the script is not part of the normal dev workflow.


2. Version pin + frequency threshold + correction marker

  • Pin: pyspellchecker==0.9.0 (current installed version, verified in container)
  • Frequency threshold: only apply a correction when freq_correction > 50. The pyspellchecker German dictionary uses 50 as a floor value — 75%+ of words sit at exactly this value. Words at the floor are known but unranked; correcting toward them produces non-deterministic results when multiple candidates tie at 50. Tested: Gartten has two candidates (garten, gatten) both at 50 — first run returned gatten, subsequent runs garten. Threshold blocks this correctly.
  • Correction marker: append [?] to corrected tokens — e.g. HauusHaus[?]. Define CORRECTION_MARKER = "[?]" in confidence.py alongside ILLEGIBLE_MARKER. The frontend already mutes any […] token in read mode — [?] is handled automatically with no frontend changes needed.
  • Update test_correctable_ocr_error_gets_corrected to assert result == "Haus[?]" instead of result == "Haus".

3. Punctuation-attached tokens

OCR output commonly produces "Haus," or „Garten" — punctuation attached to the word. Neither form is in any dictionary, causing false [unleserlich] markers on perfectly readable words.

  • Add private helper _strip_punctuation(token: str) -> tuple[str, str, str] returning (leading_punct, word, trailing_punct)
  • Strip both ends before lookup, reattach after a successful correction: Haus,Haus,[?]
  • Drop punctuation when the result is [unleserlich] — attached punctuation on an illegible marker is noise

4. _collapse_markers duplication — required plan update

The adjacent-marker collapse loop in correct_text() duplicates logic from confidence.py. With CORRECTION_MARKER now also living in confidence.py, that module is the right home for the shared helper.

  • Extract _collapse_adjacent_markers(tokens: list[str]) -> list[str] into confidence.py (private)
  • Call it from both apply_confidence_markers() and correct_text()
  • The issue plan must be updated to include this extraction before Task 2 implementation starts

The implementation plan needs four updates before coding begins: add the DTA preparation script as a prerequisite, update the version pin, add the frequency threshold and [?] marker to the spell_check.py spec, and add the _collapse_adjacent_markers extraction to confidence.py. All decisions above are agreed with the project owner.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer Pre-implementation discussion — four open items worked through and resolved. The plan needs updating before implementation starts. --- ### Resolved **1. Historical wordlist — DTA corpus instead of handcrafted list** The 50-word handcrafted `de_historical.txt` is too thin for real Kurrent documents. Replace it with a corpus-derived wordlist from the Deutsches Textarchiv (DTA). - Add `scripts/prepare_historical_dict.py` as a one-time data preparation step - Download both DTA 19th-century plain text corpora (original spelling, not normalised): - `https://www.deutschestextarchiv.de/media/download/dtak/2020-10-23/original/1800-1899.zip` — 140MB, 684 texts - `https://www.deutschestextarchiv.de/media/download/dtae/2020-10-23/original/1800-1899.zip` — 105MB, 2015 texts - Tokenise, count frequencies, filter (alphabetic, length > 3, min frequency threshold), write `ocr-service/dictionaries/de_historical.txt` sorted by frequency descending - Commit `de_historical.txt` as a static artifact; the script lives in `scripts/` for reproducibility URLs verified live (HTTP 200). This is a one-time task — the script is not part of the normal dev workflow. --- **2. Version pin + frequency threshold + correction marker** - Pin: `pyspellchecker==0.9.0` (current installed version, verified in container) - Frequency threshold: only apply a correction when `freq_correction > 50`. The pyspellchecker German dictionary uses `50` as a floor value — 75%+ of words sit at exactly this value. Words at the floor are known but unranked; correcting toward them produces non-deterministic results when multiple candidates tie at `50`. Tested: `Gartten` has two candidates (`garten`, `gatten`) both at `50` — first run returned `gatten`, subsequent runs `garten`. Threshold blocks this correctly. - Correction marker: append `[?]` to corrected tokens — e.g. `Hauus` → `Haus[?]`. Define `CORRECTION_MARKER = "[?]"` in `confidence.py` alongside `ILLEGIBLE_MARKER`. The frontend already mutes any `[…]` token in read mode — `[?]` is handled automatically with no frontend changes needed. - Update `test_correctable_ocr_error_gets_corrected` to assert `result == "Haus[?]"` instead of `result == "Haus"`. --- **3. Punctuation-attached tokens** OCR output commonly produces `"Haus,"` or `„Garten"` — punctuation attached to the word. Neither form is in any dictionary, causing false `[unleserlich]` markers on perfectly readable words. - Add private helper `_strip_punctuation(token: str) -> tuple[str, str, str]` returning `(leading_punct, word, trailing_punct)` - Strip both ends before lookup, reattach after a successful correction: `Haus,` → `Haus,[?]` - Drop punctuation when the result is `[unleserlich]` — attached punctuation on an illegible marker is noise --- **4. `_collapse_markers` duplication — required plan update** The adjacent-marker collapse loop in `correct_text()` duplicates logic from `confidence.py`. With `CORRECTION_MARKER` now also living in `confidence.py`, that module is the right home for the shared helper. - Extract `_collapse_adjacent_markers(tokens: list[str]) -> list[str]` into `confidence.py` (private) - Call it from both `apply_confidence_markers()` and `correct_text()` - The issue plan must be updated to include this extraction before Task 2 implementation starts --- The implementation plan needs four updates before coding begins: add the DTA preparation script as a prerequisite, update the version pin, add the frequency threshold and `[?]` marker to the `spell_check.py` spec, and add the `_collapse_adjacent_markers` extraction to `confidence.py`. All decisions above are agreed with the project owner.
Author
Owner

Implementation complete

All 7 tasks implemented on branch feat/issue-254-german-spell-check.

What was implemented

Task 1 — confidence.py refactor (77747aa)

  • Added CORRECTION_MARKER = "[?]" alongside ILLEGIBLE_MARKER
  • Extracted private _collapse_adjacent_markers(tokens) helper from apply_confidence_markers
  • All 21 existing confidence tests stay green

Task 2 — dependency (6faaa3b)

  • pyspellchecker==0.9.0 added to requirements.txt (exact pin)

Task 3 — DTA historical wordlist (30a6cbe)

  • scripts/prepare_historical_dict.py: downloads dtak+dtae 1800–1899 original-spelling corpora, tokenises, filters (min_freq=20), writes wordlist
  • ocr-service/dictionaries/de_historical.txt: 153,547 words committed as static artifact

Task 4 — failing tests (47f9a0b)

  • 16 tests in test_spell_check.py (all red before implementation)

Task 5 — spell_check.py (0921319)

  • Imports ILLEGIBLE_MARKER, CORRECTION_MARKER, _collapse_adjacent_markers from confidence.py — no duplication
  • _MIN_SPELL_CHECK_LEN = 4 with len(word) < 4 guard
  • _strip_punctuation(): strips non-alphanumeric from token ends before lookup, reattaches after correction
  • _is_numeric(): digit-containing tokens pass through unchanged (handles 1870er, 18.)
  • Frequency threshold: correction only applied when spell.word_frequency[correction] > 50
  • Corrected tokens marked with [?] (e.g. Hauus → Haus[?])
  • Capitalization preserved on correction
  • All 16 tests green

Task 6 — main.py integration (77100ab)

  • Import + _SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"}
  • load_spell_checker() called at startup after Kraken model load
  • Spell check applied in block mode, stream mode (full-page), and guided mode

Task 7 — CI (68b5791)

  • ocr-tests job in .gitea/workflows/ci.yml: installs only pyspellchecker==0.9.0 + pytest, runs test_spell_check.py and test_confidence.py — no ML stack required

Final test run

37 passed in 2.60s  (21 confidence + 16 spell_check)
## Implementation complete ✅ All 7 tasks implemented on branch `feat/issue-254-german-spell-check`. ### What was implemented **Task 1 — `confidence.py` refactor** (`77747aa`) - Added `CORRECTION_MARKER = "[?]"` alongside `ILLEGIBLE_MARKER` - Extracted private `_collapse_adjacent_markers(tokens)` helper from `apply_confidence_markers` - All 21 existing confidence tests stay green **Task 2 — dependency** (`6faaa3b`) - `pyspellchecker==0.9.0` added to `requirements.txt` (exact pin) **Task 3 — DTA historical wordlist** (`30a6cbe`) - `scripts/prepare_historical_dict.py`: downloads dtak+dtae 1800–1899 original-spelling corpora, tokenises, filters (min_freq=20), writes wordlist - `ocr-service/dictionaries/de_historical.txt`: 153,547 words committed as static artifact **Task 4 — failing tests** (`47f9a0b`) - 16 tests in `test_spell_check.py` (all red before implementation) **Task 5 — `spell_check.py`** (`0921319`) - Imports `ILLEGIBLE_MARKER`, `CORRECTION_MARKER`, `_collapse_adjacent_markers` from `confidence.py` — no duplication - `_MIN_SPELL_CHECK_LEN = 4` with `len(word) < 4` guard - `_strip_punctuation()`: strips non-alphanumeric from token ends before lookup, reattaches after correction - `_is_numeric()`: digit-containing tokens pass through unchanged (handles `1870er`, `18.`) - Frequency threshold: correction only applied when `spell.word_frequency[correction] > 50` - Corrected tokens marked with `[?]` (e.g. `Hauus → Haus[?]`) - Capitalization preserved on correction - All 16 tests green **Task 6 — `main.py` integration** (`77100ab`) - Import + `_SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"}` - `load_spell_checker()` called at startup after Kraken model load - Spell check applied in block mode, stream mode (full-page), and guided mode **Task 7 — CI** (`68b5791`) - `ocr-tests` job in `.gitea/workflows/ci.yml`: installs only `pyspellchecker==0.9.0` + `pytest`, runs `test_spell_check.py` and `test_confidence.py` — no ML stack required ### Final test run ``` 37 passed in 2.60s (21 confidence + 16 spell_check) ```
Sign in to join this conversation.
No Label feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#254