feat(ocr): German spell-check post-processing to reduce handwriting gibberish #260

marcel · 2026-04-17T17:12:44+02:00

marcel commented

2026-04-17 17:12:44 +02:00

Closes #254

Summary

Adds a German spell-check post-processing step after OCR confidence filtering. For HANDWRITING_KURRENT and HANDWRITING_LATIN script types, each token is looked up in pyspellchecker's German dictionary (augmented with a 153K-word DTA 19th-century historical wordlist). Plausible OCR errors are corrected and marked with [?] (e.g. Hauus → Haus[?]); unresolvable gibberish that survived the confidence filter becomes [unleserlich]. Typewritten output is unaffected. A new ocr-tests CI job runs test_spell_check.py and test_confidence.py without the full ML stack.

Closes #254 ## Summary Adds a German spell-check post-processing step after OCR confidence filtering. For `HANDWRITING_KURRENT` and `HANDWRITING_LATIN` script types, each token is looked up in `pyspellchecker`'s German dictionary (augmented with a 153K-word DTA 19th-century historical wordlist). Plausible OCR errors are corrected and marked with `[?]` (e.g. `Hauus → Haus[?]`); unresolvable gibberish that survived the confidence filter becomes `[unleserlich]`. Typewritten output is unaffected. A new `ocr-tests` CI job runs `test_spell_check.py` and `test_confidence.py` without the full ML stack.

marcel added 7 commits 2026-04-17 17:12:44 +02:00

refactor(ocr): extract _collapse_adjacent_markers helper and add CORRECTION_MARKER 77747aa556

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add pyspellchecker dependency 6faaa3b7d6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add DTA-derived historical German wordlist and generation script 30a6cbeb7f

153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(ocr): add failing tests for spell_check module 47f9a0bf73

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): add spell_check module with German spellchecker and historical wordlist 092131930c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ocr): integrate spell-check post-processing for handwriting script types 77100ab1e6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: add ocr-tests job for spell_check and confidence unit tests

CI / Unit & Component Tests (push) Failing after 2m48s

Details

CI / OCR Service Tests (push) Successful in 1m59s

Details

CI / Backend Unit Tests (push) Failing after 2m53s

Details

CI / Unit & Component Tests (pull_request) Failing after 2m52s

Details

CI / OCR Service Tests (pull_request) Successful in 33s

Details

CI / Backend Unit Tests (pull_request) Failing after 2m54s

Details

68b57918eb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-04-17 17:15:18 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

Observations

1. Importing a private function across module boundaries (spell_check.py:8)

from confidence import CORRECTION_MARKER, ILLEGIBLE_MARKER, _collapse_adjacent_markers

_collapse_adjacent_markers carries an underscore prefix, signalling "internal to confidence.py." Importing a _-prefixed name from another module contradicts that signal — future readers of confidence.py won't know the function is part of the module's effective public API. This was a deliberate decision in the issue thread, and the function genuinely belongs in confidence.py (it's about marker semantics). The fix is cheap: drop the underscore to make the contract explicit.

# confidence.py — make the shared helper public
def collapse_adjacent_markers(tokens: list[str]) -> list[str]:
    ...

This is the only blocker I'd flag. It's a naming inconsistency, not a correctness bug.

2. de_historical.txt as a committed 1.5MB blob — acceptable for this project scale, but worth noting. Every update to the wordlist will add another 1.5MB delta to git history. If the wordlist is ever updated more than 2–3 times, consider using git lfs or regenerating the file externally and fetching it at container build time. Not a blocker now.

3. Module structure is clean. spell_check.py has a clear public API (load_spell_checker, correct_text), private helpers with underscore prefixes, and no reach into other modules except via the explicit import above. The _SPELL_CHECK_SCRIPT_TYPES constant in main.py is well-placed.

4. No ADR needed for this change — it extends an existing subsystem rather than introducing a new architectural boundary.

Recommendation

Rename _collapse_adjacent_markers → collapse_adjacent_markers in confidence.py and update the import in spell_check.py. One line change, zero behavioral impact.
Everything else is solid.

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** ### Observations **1. Importing a private function across module boundaries** (`spell_check.py:8`) ```python from confidence import CORRECTION_MARKER, ILLEGIBLE_MARKER, _collapse_adjacent_markers ``` `_collapse_adjacent_markers` carries an underscore prefix, signalling "internal to `confidence.py`." Importing a `_`-prefixed name from another module contradicts that signal — future readers of `confidence.py` won't know the function is part of the module's effective public API. This was a deliberate decision in the issue thread, and the function genuinely belongs in `confidence.py` (it's about marker semantics). The fix is cheap: drop the underscore to make the contract explicit. ```python # confidence.py — make the shared helper public def collapse_adjacent_markers(tokens: list[str]) -> list[str]: ... ``` This is the only blocker I'd flag. It's a naming inconsistency, not a correctness bug. **2. `de_historical.txt` as a committed 1.5MB blob** — acceptable for this project scale, but worth noting. Every update to the wordlist will add another 1.5MB delta to git history. If the wordlist is ever updated more than 2–3 times, consider using `git lfs` or regenerating the file externally and fetching it at container build time. Not a blocker now. **3. Module structure is clean.** `spell_check.py` has a clear public API (`load_spell_checker`, `correct_text`), private helpers with underscore prefixes, and no reach into other modules except via the explicit import above. The `_SPELL_CHECK_SCRIPT_TYPES` constant in `main.py` is well-placed. **4. No ADR needed** for this change — it extends an existing subsystem rather than introducing a new architectural boundary. ### Recommendation - Rename `_collapse_adjacent_markers` → `collapse_adjacent_markers` in `confidence.py` and update the import in `spell_check.py`. One line change, zero behavioral impact. - Everything else is solid.

marcel commented

2026-04-17 17:18:03 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

Blockers

_collapse_adjacent_markers imported across module boundary (spell_check.py:8)

The underscore prefix is a Python convention for "private to this module." Importing it from confidence.py in spell_check.py violates that contract and will confuse future readers. Markus flagged the same issue from an architecture angle — the fix is identical:

# confidence.py — rename (drop the underscore)
def collapse_adjacent_markers(tokens: list[str]) -> list[str]:
    ...

# spell_check.py
from confidence import CORRECTION_MARKER, ILLEGIBLE_MARKER, collapse_adjacent_markers

Suggestions

correct_text is doing three things — token classification, correction logic, AND collapse. At 50 lines it's approaching the extract-me signal. A private _check_token(token) -> str helper that handles one token's decision tree would make the loop body a single-purpose orchestrator. Not a blocker; a refactor-phase suggestion.

Threshold comment missing (spell_check.py:110) — > 50 is a non-obvious constant with a specific reason (avoids non-determinism when candidates tie at the floor). Per the codebase's comment policy, a one-liner here earns its place:

if correction and _spell.word_frequency[correction] > 50:  # floor avoids ties at minimum frequency

TDD evidence — test file was committed before implementation, tests cover all 16 cases from the plan, and the autouse fixture prevents unloaded-checker surprises. Solid red/green discipline throughout. ✅

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** ### Blockers **`_collapse_adjacent_markers` imported across module boundary** (`spell_check.py:8`) The underscore prefix is a Python convention for "private to this module." Importing it from `confidence.py` in `spell_check.py` violates that contract and will confuse future readers. Markus flagged the same issue from an architecture angle — the fix is identical: ```python # confidence.py — rename (drop the underscore) def collapse_adjacent_markers(tokens: list[str]) -> list[str]: ... # spell_check.py from confidence import CORRECTION_MARKER, ILLEGIBLE_MARKER, collapse_adjacent_markers ``` ### Suggestions **`correct_text` is doing three things** — token classification, correction logic, AND collapse. At 50 lines it's approaching the extract-me signal. A private `_check_token(token) -> str` helper that handles one token's decision tree would make the loop body a single-purpose orchestrator. Not a blocker; a refactor-phase suggestion. **Threshold comment missing** (`spell_check.py:110`) — `> 50` is a non-obvious constant with a specific reason (avoids non-determinism when candidates tie at the floor). Per the codebase's comment policy, a one-liner here earns its place: ```python if correction and _spell.word_frequency[correction] > 50: # floor avoids ties at minimum frequency ``` **TDD evidence** — test file was committed before implementation, tests cover all 16 cases from the plan, and the `autouse` fixture prevents unloaded-checker surprises. Solid red/green discipline throughout. ✅

marcel commented

2026-04-17 17:18:14 +02:00

🔐 Nora Steiner — Application Security Engineer

Verdict: ✅ Approved

What I checked

This change adds a pure in-process text transformation pipeline (no network calls, no file writes, no subprocess invocations). The threat surface is narrow.

Historical wordlist loading (spell_check.py:29-37) — Path is constructed via os.path.join(os.path.dirname(__file__), "dictionaries", "de_historical.txt"). This is a static path derived from the module location, not user-controlled input. No path traversal vector.

Input processing — correct_text() receives OCR output from trusted internal callers (after it has already been through Kraken/Surya). No external user input reaches this function directly. Even if malicious text were passed in, the function only splits on whitespace and replaces tokens — no eval, no subprocess, no template rendering.

pyspellchecker==0.9.0 supply chain — exact pin is good hygiene. The library is pure Python (no native extensions), reducing the binary supply chain risk surface.

load_spell_checker() startup failure — the plan says "fail hard if it raises." Looking at main.py, this runs in lifespan which will propagate exceptions and prevent the app from starting. Correct behavior — silent degradation would mean undetected data quality issues.

Notes

The _MIN_SPELL_CHECK_LEN = 4 constant avoids processing very short tokens. This is good — short-word spell-check results are notoriously unstable and could in theory cause denial-of-service via high-cost correction lookups on pathological inputs. The exemption closes that avenue.

Nothing security-relevant to block on.

## 🔐 Nora Steiner — Application Security Engineer **Verdict: ✅ Approved** ### What I checked This change adds a pure in-process text transformation pipeline (no network calls, no file writes, no subprocess invocations). The threat surface is narrow. **Historical wordlist loading** (`spell_check.py:29-37`) — Path is constructed via `os.path.join(os.path.dirname(__file__), "dictionaries", "de_historical.txt")`. This is a static path derived from the module location, not user-controlled input. No path traversal vector. **Input processing** — `correct_text()` receives OCR output from trusted internal callers (after it has already been through Kraken/Surya). No external user input reaches this function directly. Even if malicious text were passed in, the function only splits on whitespace and replaces tokens — no eval, no subprocess, no template rendering. **`pyspellchecker==0.9.0` supply chain** — exact pin is good hygiene. The library is pure Python (no native extensions), reducing the binary supply chain risk surface. **`load_spell_checker()` startup failure** — the plan says "fail hard if it raises." Looking at `main.py`, this runs in `lifespan` which will propagate exceptions and prevent the app from starting. Correct behavior — silent degradation would mean undetected data quality issues. ### Notes The `_MIN_SPELL_CHECK_LEN = 4` constant avoids processing very short tokens. This is good — short-word spell-check results are notoriously unstable and could in theory cause denial-of-service via high-cost correction lookups on pathological inputs. The exemption closes that avenue. Nothing security-relevant to block on.

marcel commented

2026-04-17 17:18:22 +02:00

🧪 Sara Holt — QA Engineer

Verdict: ⚠️ Approved with concerns

Blockers

None.

Suggestions

test_correctable_ocr_error_gets_corrected asserts an exact string (test_spell_check.py:58-59)

def test_correctable_ocr_error_gets_corrected():
    result = correct_text("Hauus")
    assert result == "Haus[?]"

This test pins the exact output of pyspellchecker's correction algorithm for "Hauus". If a future library update changes the internal frequency weights or correction ranking, this test fails — not because the feature broke, but because the dictionary changed. Consider a softer assertion:

assert result != "Hauus"           # correction was attempted
assert result != "[unleserlich]"   # not marked illegible
assert "[?]" in result             # correction marker present
assert result.startswith("Haus")   # expected stem

This makes the intent explicit without coupling to a specific library version's dictionary state.

Missing main.py integration test — the spell-check is wired into three code paths in main.py (block mode, stream mode, guided mode), but there are no tests verifying this wiring. I understand the ML stack makes this expensive; a minimal smoke test with the HANDWRITING_KURRENT script type mocking out the OCR engine would give confidence that the routing logic is correct. Suggest creating a follow-up issue rather than blocking this PR.

Positive notes:

16 tests covering all specified behaviors — the plan's full acceptance criterion list is covered ✅
autouse=True fixture prevents the unloaded-checker edge case from slipping through ✅
monkeypatch.setattr for the not-loaded test is the right approach — no state leaks between tests ✅
Adjacent marker collapse is tested both for pre-existing markers and for adjacent gibberish words ✅

## 🧪 Sara Holt — QA Engineer **Verdict: ⚠️ Approved with concerns** ### Blockers None. ### Suggestions **`test_correctable_ocr_error_gets_corrected` asserts an exact string** (`test_spell_check.py:58-59`) ```python def test_correctable_ocr_error_gets_corrected(): result = correct_text("Hauus") assert result == "Haus[?]" ``` This test pins the exact output of `pyspellchecker`'s correction algorithm for `"Hauus"`. If a future library update changes the internal frequency weights or correction ranking, this test fails — not because the feature broke, but because the dictionary changed. Consider a softer assertion: ```python assert result != "Hauus" # correction was attempted assert result != "[unleserlich]" # not marked illegible assert "[?]" in result # correction marker present assert result.startswith("Haus") # expected stem ``` This makes the intent explicit without coupling to a specific library version's dictionary state. **Missing main.py integration test** — the spell-check is wired into three code paths in `main.py` (block mode, stream mode, guided mode), but there are no tests verifying this wiring. I understand the ML stack makes this expensive; a minimal smoke test with the `HANDWRITING_KURRENT` script type mocking out the OCR engine would give confidence that the routing logic is correct. Suggest creating a follow-up issue rather than blocking this PR. **Positive notes:** - 16 tests covering all specified behaviors — the plan's full acceptance criterion list is covered ✅ - `autouse=True` fixture prevents the unloaded-checker edge case from slipping through ✅ - `monkeypatch.setattr` for the not-loaded test is the right approach — no state leaks between tests ✅ - Adjacent marker collapse is tested both for pre-existing markers and for adjacent gibberish words ✅

marcel commented

2026-04-17 17:18:32 +02:00

🛠 Tobias Wendt — DevOps & Platform Engineer

Verdict: ✅ Approved

What I checked

New ocr-tests job (.gitea/workflows/ci.yml) — exactly the spec from the issue. Two-step pip install + pytest invocation, correct working-directory: ocr-service, runs both test_spell_check.py and test_confidence.py in one command. Clean.

No ML stack dependency — the job installs only pyspellchecker==0.9.0 and pytest. This keeps CI runtime to under 30 seconds for this job, no GPU runners needed. ✅

Version pin consistency — pip install "pyspellchecker==0.9.0" in CI matches requirements.txt exactly. No drift risk. ✅

ubuntu-latest — this is the same runner used by all other jobs in this workflow. Consistent. If we ever want reproducibility across the workflow, pinning to ubuntu-24.04 would make sense, but that's a whole-workflow change, not something to block this PR on.

Minor suggestion

The pip install command installs into the system Python on the runner, not a venv. That's fine for CI — runners are ephemeral. But if we ever add more jobs that need conflicting packages, a python -m venv .venv && .venv/bin/pip install ... pattern would isolate them. Not needed now.

de_historical.txt committed as a static artifact — 1.5MB static file in the repo is a pragmatic call. The alternative (downloading 243MB of DTA ZIPs on every CI run) would add 2+ minutes and an external dependency on the DTA server. Static artifact is the right choice for a family project. ✅

## 🛠 Tobias Wendt — DevOps & Platform Engineer **Verdict: ✅ Approved** ### What I checked **New `ocr-tests` job** (`.gitea/workflows/ci.yml`) — exactly the spec from the issue. Two-step pip install + pytest invocation, correct `working-directory: ocr-service`, runs both `test_spell_check.py` and `test_confidence.py` in one command. Clean. **No ML stack dependency** — the job installs only `pyspellchecker==0.9.0` and `pytest`. This keeps CI runtime to under 30 seconds for this job, no GPU runners needed. ✅ **Version pin consistency** — `pip install "pyspellchecker==0.9.0"` in CI matches `requirements.txt` exactly. No drift risk. ✅ **`ubuntu-latest`** — this is the same runner used by all other jobs in this workflow. Consistent. If we ever want reproducibility across the workflow, pinning to `ubuntu-24.04` would make sense, but that's a whole-workflow change, not something to block this PR on. ### Minor suggestion The pip install command installs into the system Python on the runner, not a venv. That's fine for CI — runners are ephemeral. But if we ever add more jobs that need conflicting packages, a `python -m venv .venv && .venv/bin/pip install ...` pattern would isolate them. Not needed now. **`de_historical.txt` committed as a static artifact** — 1.5MB static file in the repo is a pragmatic call. The alternative (downloading 243MB of DTA ZIPs on every CI run) would add 2+ minutes and an external dependency on the DTA server. Static artifact is the right choice for a family project. ✅

marcel commented

2026-04-17 17:18:37 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

This PR is entirely within the OCR service Python backend — no frontend components, no Svelte files, no UI changes, no user-facing strings modified.

The markers [unleserlich] and [?] that this feature produces are displayed in the transcription view, but that rendering was already in place before this PR. If a future PR exposes these markers more prominently in the UI (e.g. highlight [?] corrections in a different colour, offer inline correction acceptance), that would be the right moment to review the UX treatment — colour-alone cues, icon+text pairs, accessible labeling. Not in scope here.

Nothing to flag.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** This PR is entirely within the OCR service Python backend — no frontend components, no Svelte files, no UI changes, no user-facing strings modified. The markers `[unleserlich]` and `[?]` that this feature produces are displayed in the transcription view, but that rendering was already in place before this PR. If a future PR exposes these markers more prominently in the UI (e.g. highlight `[?]` corrections in a different colour, offer inline correction acceptance), that would be the right moment to review the UX treatment — colour-alone cues, icon+text pairs, accessible labeling. Not in scope here. Nothing to flag.

marcel referenced this pull request

2026-04-17 17:24:01 +02:00

test(ocr): add integration tests for spell-check routing in main.py #262

marcel added 3 commits 2026-04-17 17:24:17 +02:00

refactor(ocr): make collapse_adjacent_markers a public function fea24aee25

Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.

Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(ocr): document > 50 frequency threshold rationale ec85f228c1

Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(ocr): decouple correction tests from exact library dictionary state

CI / Unit & Component Tests (pull_request) Successful in 3m35s

Details

CI / OCR Service Tests (pull_request) Successful in 36s

Details

CI / Backend Unit Tests (pull_request) Failing after 2m47s

Details

CI / Unit & Component Tests (push) Failing after 2m33s

Details

CI / OCR Service Tests (push) Successful in 34s

Details

CI / Backend Unit Tests (push) Failing after 2m41s

Details

c5e6ed922b

Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-04-17 17:24:27 +02:00

Review feedback addressed

Three concerns from review cycle 1 have been resolved:

fea24ae — refactor(ocr): make collapse_adjacent_markers a public function
Drops the underscore prefix from _collapse_adjacent_markers in confidence.py and updates the import in spell_check.py. Resolves the module-boundary concern raised by @Markus and @Felix.

ec85f22 — refactor(ocr): document > 50 frequency threshold rationale
Adds an inline comment explaining the strict > 50 threshold: it prevents non-deterministic corrections when multiple candidates tie at the frequency floor. Resolves @Felix's suggestion.

c5e6ed9 — test(ocr): decouple correction tests from exact library dictionary state
Replaces exact-string assertions in test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections with structural assertions (not illegible, marker present, expected stem). Resolves @Sara's suggestion.

Deferred: main.py integration test coverage for spell-check routing tracked as issue #262.

All 37 tests green.

## Review feedback addressed Three concerns from review cycle 1 have been resolved: **fea24ae** — `refactor(ocr): make collapse_adjacent_markers a public function` Drops the underscore prefix from `_collapse_adjacent_markers` in `confidence.py` and updates the import in `spell_check.py`. Resolves the module-boundary concern raised by @Markus and @Felix. **ec85f22** — `refactor(ocr): document > 50 frequency threshold rationale` Adds an inline comment explaining the strict `> 50` threshold: it prevents non-deterministic corrections when multiple candidates tie at the frequency floor. Resolves @Felix's suggestion. **c5e6ed9** — `test(ocr): decouple correction tests from exact library dictionary state` Replaces exact-string assertions in `test_correctable_ocr_error_gets_corrected` and `test_sentence_with_multiple_corrections` with structural assertions (not illegible, marker present, expected stem). Resolves @Sara's suggestion. **Deferred**: `main.py` integration test coverage for spell-check routing tracked as issue #262. All 37 tests green.

marcel commented

2026-04-17 17:25:21 +02:00

🏗️ Markus Keller — Senior Application Architect (Review Cycle 2)

Verdict: ✅ Approved

The cycle 1 blocker is resolved. collapse_adjacent_markers is now a public function in confidence.py (no underscore) and the import in spell_check.py correctly uses the public name. The module contract is now consistent: readers of confidence.py see a public helper that is part of the effective API, and readers of spell_check.py see a normal cross-module import.

Everything else from cycle 1 stands — module structure is clean, main.py integration points are well-placed, and the de_historical.txt artifact decision is pragmatic and documented.

Ready to merge.

## 🏗️ Markus Keller — Senior Application Architect (Review Cycle 2) **Verdict: ✅ Approved** The cycle 1 blocker is resolved. `collapse_adjacent_markers` is now a public function in `confidence.py` (no underscore) and the import in `spell_check.py` correctly uses the public name. The module contract is now consistent: readers of `confidence.py` see a public helper that is part of the effective API, and readers of `spell_check.py` see a normal cross-module import. Everything else from cycle 1 stands — module structure is clean, `main.py` integration points are well-placed, and the `de_historical.txt` artifact decision is pragmatic and documented. Ready to merge.

marcel commented

2026-04-17 17:25:27 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer (Review Cycle 2)

Verdict: ✅ Approved

All cycle 1 concerns addressed:

Naming: collapse_adjacent_markers is public — import in spell_check.py:8 uses the bare name. ✅
Threshold comment: > 50 # strict > avoids non-determinism when candidates tie at the frequency floor — exactly the right one-liner. ✅
Test assertions: test_correctable_ocr_error_gets_corrected and test_sentence_with_multiple_corrections now assert behavior (marker present, stem matches, not illegible) without coupling to the library's exact dictionary state. ✅

The correct_text function at ~50 lines is still a refactor candidate (_check_token helper would clean up the loop body), but that's a future-sprint note, not a merge blocker. The function is well-named, well-documented, and does nothing surprising.

TDD discipline held through both cycles. Ship it.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer (Review Cycle 2) **Verdict: ✅ Approved** All cycle 1 concerns addressed: - **Naming**: `collapse_adjacent_markers` is public — import in `spell_check.py:8` uses the bare name. ✅ - **Threshold comment**: `> 50 # strict > avoids non-determinism when candidates tie at the frequency floor` — exactly the right one-liner. ✅ - **Test assertions**: `test_correctable_ocr_error_gets_corrected` and `test_sentence_with_multiple_corrections` now assert behavior (marker present, stem matches, not illegible) without coupling to the library's exact dictionary state. ✅ The `correct_text` function at ~50 lines is still a refactor candidate (`_check_token` helper would clean up the loop body), but that's a future-sprint note, not a merge blocker. The function is well-named, well-documented, and does nothing surprising. TDD discipline held through both cycles. Ship it.

marcel commented

2026-04-17 17:25:29 +02:00

🔐 Nora Steiner — Application Security Engineer (Review Cycle 2)

Verdict: ✅ Approved

No new security-relevant changes in this cycle. The three commits (public rename, threshold comment, test softening) are all cosmetic or test changes — no new code paths, no new network calls, no new file I/O.

The security posture from cycle 1 is unchanged and clean: pure in-process transformation, static wordlist path, exact pin on pyspellchecker==0.9.0, fail-hard startup. Nothing to flag.

## 🔐 Nora Steiner — Application Security Engineer (Review Cycle 2) **Verdict: ✅ Approved** No new security-relevant changes in this cycle. The three commits (public rename, threshold comment, test softening) are all cosmetic or test changes — no new code paths, no new network calls, no new file I/O. The security posture from cycle 1 is unchanged and clean: pure in-process transformation, static wordlist path, exact pin on `pyspellchecker==0.9.0`, fail-hard startup. Nothing to flag.

marcel commented

2026-04-17 17:25:34 +02:00

🧪 Sara Holt — QA Engineer (Review Cycle 2)

Verdict: ✅ Approved

Both cycle 1 suggestions were addressed:

Test assertions decoupled — test_correctable_ocr_error_gets_corrected now uses four structural assertions, and test_sentence_with_multiple_corrections uses token-indexed assertions. These tests verify behavior without binding to a specific version's dictionary weights. The intent of each assertion is self-documenting ("[?]" in result, result.startswith("Haus")). ✅

Integration test deferred to issue #262 — the ticket is well-scoped: three parametrized tests (block / stream / guided), mock strategy spelled out, the three patch targets identified (_download_and_convert_pdf, kraken_engine.extract_blocks, correct_text). That's actionable and properly separated from this PR. ✅

37/37 passing. Ready to merge.

## 🧪 Sara Holt — QA Engineer (Review Cycle 2) **Verdict: ✅ Approved** Both cycle 1 suggestions were addressed: **Test assertions decoupled** — `test_correctable_ocr_error_gets_corrected` now uses four structural assertions, and `test_sentence_with_multiple_corrections` uses token-indexed assertions. These tests verify behavior without binding to a specific version's dictionary weights. The intent of each assertion is self-documenting (`"[?]" in result`, `result.startswith("Haus")`). ✅ **Integration test deferred to issue #262** — the ticket is well-scoped: three parametrized tests (block / stream / guided), mock strategy spelled out, the three patch targets identified (`_download_and_convert_pdf`, `kraken_engine.extract_blocks`, `correct_text`). That's actionable and properly separated from this PR. ✅ 37/37 passing. Ready to merge.

marcel commented

2026-04-17 17:25:37 +02:00

🛠 Tobias Wendt — DevOps & Platform Engineer (Review Cycle 2)

Verdict: ✅ Approved

No changes to CI or infrastructure files in this cycle. The ocr-tests job, requirements.txt pin, and de_historical.txt static artifact are all unchanged and were approved in cycle 1.

Nothing new to flag. Good to go.

## 🛠 Tobias Wendt — DevOps & Platform Engineer (Review Cycle 2) **Verdict: ✅ Approved** No changes to CI or infrastructure files in this cycle. The `ocr-tests` job, `requirements.txt` pin, and `de_historical.txt` static artifact are all unchanged and were approved in cycle 1. Nothing new to flag. Good to go.

marcel commented

2026-04-17 17:25:40 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist (Review Cycle 2)

Verdict: ✅ Approved

Still entirely backend Python — no frontend, no Svelte, no UI strings. Nothing to flag. Same as cycle 1.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist (Review Cycle 2) **Verdict: ✅ Approved** Still entirely backend Python — no frontend, no Svelte, no UI strings. Nothing to flag. Same as cycle 1.

marcel merged commit c5e6ed922b into main

2026-04-17 17:28:41 +02:00

marcel deleted branch feat/issue-254-german-spell-check

2026-04-17 17:28:42 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#260