feat(ocr): feedback loop to improve spell-check dictionary from user corrections #259

Open
opened 2026-04-17 16:28:04 +02:00 by marcel · 0 comments
Owner

Background

Issue #254 adds German spell-check post-processing for handwriting OCR. Corrected tokens are marked [?] (e.g. Haus[?]) and unresolvable tokens become [unleserlich]. Both markers are visible in the transcription editor.

This means every time a user edits a transcription, we implicitly know what the spell checker said and what the human said instead — a feedback signal we are currently discarding.

The Opportunity

Two high-value signals from user edits:

  1. User replaces [unleserlich] with a real word — the word appeared in a real document and a human confirmed the reading. Strong candidate for dictionaries/de_historical.txt.
  2. User changes Haus[?] to something else — the spell checker's correction was wrong. Useful for tuning the frequency threshold or identifying bad corrections.

Proposed Approach (semi-automatic)

A fully automatic loop (one acceptance → added to dictionary) would be too noisy at this project's user scale. A semi-automatic loop is safer:

  1. Log correction events — when a transcription save touches a [?] or [unleserlich] token, record the before/after in a spell_corrections table (or similar).
  2. Periodic review query — monthly or on-demand: "words that replaced [unleserlich] across N distinct documents by different users" (N=3–5 to filter one-off typos).
  3. Human-approved update — reviewer confirms the list, adds to de_historical.txt, commits. No automated dictionary writes.

Open Questions for Discussion

  • Where do we capture the diff — in the transcription save endpoint, or as a separate event?
  • What schema is sufficient for the correction log? (word, document_id, user_id, timestamp, before, after)
  • Should [?] acceptances (user saves without changing the token) count as positive signal, or are they too passive?
  • Is N=3 a reasonable threshold for a family archive with a small user base, or should it be purely manual review?

Dependency

Depends on #254 being merged first.

## Background Issue #254 adds German spell-check post-processing for handwriting OCR. Corrected tokens are marked `[?]` (e.g. `Haus[?]`) and unresolvable tokens become `[unleserlich]`. Both markers are visible in the transcription editor. This means every time a user edits a transcription, we implicitly know what the spell checker said and what the human said instead — a feedback signal we are currently discarding. ## The Opportunity Two high-value signals from user edits: 1. **User replaces `[unleserlich]` with a real word** — the word appeared in a real document and a human confirmed the reading. Strong candidate for `dictionaries/de_historical.txt`. 2. **User changes `Haus[?]` to something else** — the spell checker's correction was wrong. Useful for tuning the frequency threshold or identifying bad corrections. ## Proposed Approach (semi-automatic) A fully automatic loop (one acceptance → added to dictionary) would be too noisy at this project's user scale. A semi-automatic loop is safer: 1. **Log correction events** — when a transcription save touches a `[?]` or `[unleserlich]` token, record the before/after in a `spell_corrections` table (or similar). 2. **Periodic review query** — monthly or on-demand: "words that replaced `[unleserlich]` across N distinct documents by different users" (N=3–5 to filter one-off typos). 3. **Human-approved update** — reviewer confirms the list, adds to `de_historical.txt`, commits. No automated dictionary writes. ## Open Questions for Discussion - Where do we capture the diff — in the transcription save endpoint, or as a separate event? - What schema is sufficient for the correction log? (word, document_id, user_id, timestamp, before, after) - Should `[?]` acceptances (user saves without changing the token) count as positive signal, or are they too passive? - Is N=3 a reasonable threshold for a family archive with a small user base, or should it be purely manual review? ## Dependency Depends on #254 being merged first.
marcel added the feature label 2026-04-17 16:28:11 +02:00
Sign in to join this conversation.
No Label feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#259