feat: page-by-page streaming OCR with real-time progress #88

marcel · 2026-03-27T10:04:18+01:00

marcel commented

2026-03-27 10:04:18 +01:00

Summary

Replace all-at-once OCR HTTP response with NDJSON streaming (POST /ocr/stream) — Python sends one JSON line per completed page, Java consumes incrementally
Persist transcription blocks as each page arrives instead of waiting for all pages to finish, eliminating the 10-minute timeout on long documents
Show per-page progress in the frontend ("Seite 3 von 7 wird analysiert…") instead of a generic "OCR-Analyse läuft"
Fix 0-based page numbers to 1-based to match the PDF viewer
Allow re-running OCR when transcription blocks already exist (collapsible trigger below block list)

Closes #231

Changes by layer

Python OCR Service

Extract extract_page_blocks() from both Surya and Kraken engines for per-page processing
Add POST /ocr/stream NDJSON endpoint with start/page/error/done event protocol
Per-page error handling: log traceback, yield generic error, continue with next page
X-Accel-Buffering: no + Cache-Control: no-cache headers for reverse-proxy compatibility
38 tests, 88% coverage on production code

Java Backend

OcrStreamEvent sealed interface with Start/Page/Error/Done record subtypes
OcrClient.streamBlocks() default method synthesizes events from extractBlocks() (backward compat)
RestClientOcrClient.streamBlocks() parses NDJSON with dedicated ObjectMapper, falls back to /ocr on 404
OcrAsyncRunner.runSingleDocument() rewritten to use streaming with per-page block persistence
Batch path (runBatch) unchanged — stays on old extractBlocks()

Frontend

Extract translateOcrProgress to testable $lib/ocr/ module with structured return type
Parse ANALYZING_PAGE:current:total:blocks and DONE:count:skipped progress codes
Inline skipped-pages warning in amber when pages fail during OCR
Collapsible "OCR erneut ausführen…" trigger in edit mode when blocks exist
i18n keys added for de/en/es

Test plan

Trigger OCR on a multi-page PDF — verify per-page progress messages appear
Verify transcription blocks are created incrementally (visible after each page)
Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works
Verify annotations appear on correct pages (1-based, not 0-based)
Verify re-run OCR option appears in edit mode when blocks exist
Run cd backend && ./mvnw test — 831 tests pass
Run cd frontend && npm run test — 698 tests pass
Run cd ocr-service && python -m pytest — 38 tests pass

## Summary - Replace all-at-once OCR HTTP response with NDJSON streaming (`POST /ocr/stream`) — Python sends one JSON line per completed page, Java consumes incrementally - Persist transcription blocks as each page arrives instead of waiting for all pages to finish, eliminating the 10-minute timeout on long documents - Show per-page progress in the frontend ("Seite 3 von 7 wird analysiert…") instead of a generic "OCR-Analyse läuft" - Fix 0-based page numbers to 1-based to match the PDF viewer - Allow re-running OCR when transcription blocks already exist (collapsible trigger below block list) Closes #231 ## Changes by layer ### Python OCR Service - Extract `extract_page_blocks()` from both Surya and Kraken engines for per-page processing - Add `POST /ocr/stream` NDJSON endpoint with `start`/`page`/`error`/`done` event protocol - Per-page error handling: log traceback, yield generic error, continue with next page - `X-Accel-Buffering: no` + `Cache-Control: no-cache` headers for reverse-proxy compatibility - 38 tests, 88% coverage on production code ### Java Backend - `OcrStreamEvent` sealed interface with `Start`/`Page`/`Error`/`Done` record subtypes - `OcrClient.streamBlocks()` default method synthesizes events from `extractBlocks()` (backward compat) - `RestClientOcrClient.streamBlocks()` parses NDJSON with dedicated `ObjectMapper`, falls back to `/ocr` on 404 - `OcrAsyncRunner.runSingleDocument()` rewritten to use streaming with per-page block persistence - Batch path (`runBatch`) unchanged — stays on old `extractBlocks()` ### Frontend - Extract `translateOcrProgress` to testable `$lib/ocr/` module with structured return type - Parse `ANALYZING_PAGE:current:total:blocks` and `DONE:count:skipped` progress codes - Inline skipped-pages warning in amber when pages fail during OCR - Collapsible "OCR erneut ausführen…" trigger in edit mode when blocks exist - i18n keys added for de/en/es ## Test plan - [ ] Trigger OCR on a multi-page PDF — verify per-page progress messages appear - [ ] Verify transcription blocks are created incrementally (visible after each page) - [ ] Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works - [ ] Verify annotations appear on correct pages (1-based, not 0-based) - [ ] Verify re-run OCR option appears in edit mode when blocks exist - [ ] Run `cd backend && ./mvnw test` — 831 tests pass - [ ] Run `cd frontend && npm run test` — 698 tests pass - [ ] Run `cd ocr-service && python -m pytest` — 38 tests pass

marcel added 4 commits 2026-03-27 10:04:19 +01:00

fix(e2e): fix two flaky annotation tests 065dd8fabd

Test 6 (delete annotation): the mouse-draw test can create multiple
annotations in CI. Changed the assertion to `countBefore - 1` instead
of a hard-coded 0, so the test is resilient to any pre-existing count.

Test 7 (hash versioning): `[data-testid^="annotation-"]` matched both
real annotation elements AND `annotation-outdated-notice` (which also
starts with "annotation-"), inflating the count to 2 instead of 0.
Added `:not([data-testid="annotation-outdated-notice"])` to exclude the
notice from the count assertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(#68 ): fall back to filename as title when createDocument gets no title d6f4ea05d9

When a document is created without an explicit title (null or blank),
the service now derives the title from the uploaded filename using the
same titleFromFilename() logic already used by storeDocument — stripping
the extension for plain names and formatting structured names as
"Firstname Lastname (DD.MM.YYYY)".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(#68 ): lead new document form with file upload, all metadata optional

CI / Unit & Component Tests (push) Failing after 1m17s

Details

CI / Backend Unit Tests (push) Failing after 9h3m48s

Details

CI / E2E Tests (push) Failing after 28m15s

Details

c5e28ac18e

Restructure the "New Document" page so users can save quickly:

- FileSectionNew becomes the first element, redesigned as a prominent
  upload zone with an icon and large click target
- Title field is rendered standalone below the upload zone; it
  auto-populates from the filename (via parseFilename + stripExtension
  fallback) unless the user has already typed something
- All remaining metadata (who/when, description, transcription) moves
  into a collapsible "Weitere Details" section that auto-expands when
  URL prefill data or a form error is present, or when filename parsing
  detects a date/person
- title is no longer required — the form can be saved with only a file
- DescriptionSection gains a `hideTitle` prop for use in this layout
- `form_label_title` translation key no longer carries a hardcoded `*`;
  the asterisk is rendered by the template only when `titleRequired` is
  set (currently only the edit form)
- E2E tests added for all three scenarios from the issue

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(#68 ): hide native file input, show selected filename in upload zone

CI / Unit & Component Tests (pull_request) Successful in 2m47s

Details

CI / Backend Unit Tests (push) Has been cancelled

Details

CI / E2E Tests (push) Has been cancelled

Details

CI / Unit & Component Tests (push) Has been cancelled

Details

CI / E2E Tests (pull_request) Has been cancelled

Details

CI / Backend Unit Tests (pull_request) Has been cancelled

Details

a7eaa40852

The native browser file input showed an untranslatable "Browse…" button
and "No file selected" text. The input is now sr-only; the large upload
zone label acts as the sole click target. When a file is selected its
name replaces the prompt text inside the zone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel merged commit a7eaa40852 into main

2026-03-27 10:04:33 +01:00

marcel deleted branch feature/68-new-document-file-first

2026-03-27 10:04:34 +01:00

marcel changed title from ~~feature/68-new-document-file-first~~ to feat: page-by-page streaming OCR with real-time progress

2026-04-13 10:55:29 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#88