feat(ocr): feedback loop to improve spell-check dictionary from user corrections #259
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Background
Issue #254 adds German spell-check post-processing for handwriting OCR. Corrected tokens are marked
[?](e.g.Haus[?]) and unresolvable tokens become[unleserlich]. Both markers are visible in the transcription editor.This means every time a user edits a transcription, we implicitly know what the spell checker said and what the human said instead — a feedback signal we are currently discarding.
The Opportunity
Two high-value signals from user edits:
[unleserlich]with a real word — the word appeared in a real document and a human confirmed the reading. Strong candidate fordictionaries/de_historical.txt.Haus[?]to something else — the spell checker's correction was wrong. Useful for tuning the frequency threshold or identifying bad corrections.Proposed Approach (semi-automatic)
A fully automatic loop (one acceptance → added to dictionary) would be too noisy at this project's user scale. A semi-automatic loop is safer:
[?]or[unleserlich]token, record the before/after in aspell_correctionstable (or similar).[unleserlich]across N distinct documents by different users" (N=3–5 to filter one-off typos).de_historical.txt, commits. No automated dictionary writes.Open Questions for Discussion
[?]acceptances (user saves without changing the token) count as positive signal, or are they too passive?Dependency
Depends on #254 being merged first.