feat: page-by-page streaming OCR with real-time progress #88

Merged
marcel merged 4 commits from feature/68-new-document-file-first into main 2026-03-27 10:04:33 +01:00
Owner

Summary

  • Replace all-at-once OCR HTTP response with NDJSON streaming (POST /ocr/stream) — Python sends one JSON line per completed page, Java consumes incrementally
  • Persist transcription blocks as each page arrives instead of waiting for all pages to finish, eliminating the 10-minute timeout on long documents
  • Show per-page progress in the frontend ("Seite 3 von 7 wird analysiert…") instead of a generic "OCR-Analyse läuft"
  • Fix 0-based page numbers to 1-based to match the PDF viewer
  • Allow re-running OCR when transcription blocks already exist (collapsible trigger below block list)

Closes #231

Changes by layer

Python OCR Service

  • Extract extract_page_blocks() from both Surya and Kraken engines for per-page processing
  • Add POST /ocr/stream NDJSON endpoint with start/page/error/done event protocol
  • Per-page error handling: log traceback, yield generic error, continue with next page
  • X-Accel-Buffering: no + Cache-Control: no-cache headers for reverse-proxy compatibility
  • 38 tests, 88% coverage on production code

Java Backend

  • OcrStreamEvent sealed interface with Start/Page/Error/Done record subtypes
  • OcrClient.streamBlocks() default method synthesizes events from extractBlocks() (backward compat)
  • RestClientOcrClient.streamBlocks() parses NDJSON with dedicated ObjectMapper, falls back to /ocr on 404
  • OcrAsyncRunner.runSingleDocument() rewritten to use streaming with per-page block persistence
  • Batch path (runBatch) unchanged — stays on old extractBlocks()

Frontend

  • Extract translateOcrProgress to testable $lib/ocr/ module with structured return type
  • Parse ANALYZING_PAGE:current:total:blocks and DONE:count:skipped progress codes
  • Inline skipped-pages warning in amber when pages fail during OCR
  • Collapsible "OCR erneut ausführen…" trigger in edit mode when blocks exist
  • i18n keys added for de/en/es

Test plan

  • Trigger OCR on a multi-page PDF — verify per-page progress messages appear
  • Verify transcription blocks are created incrementally (visible after each page)
  • Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works
  • Verify annotations appear on correct pages (1-based, not 0-based)
  • Verify re-run OCR option appears in edit mode when blocks exist
  • Run cd backend && ./mvnw test — 831 tests pass
  • Run cd frontend && npm run test — 698 tests pass
  • Run cd ocr-service && python -m pytest — 38 tests pass
## Summary - Replace all-at-once OCR HTTP response with NDJSON streaming (`POST /ocr/stream`) — Python sends one JSON line per completed page, Java consumes incrementally - Persist transcription blocks as each page arrives instead of waiting for all pages to finish, eliminating the 10-minute timeout on long documents - Show per-page progress in the frontend ("Seite 3 von 7 wird analysiert…") instead of a generic "OCR-Analyse läuft" - Fix 0-based page numbers to 1-based to match the PDF viewer - Allow re-running OCR when transcription blocks already exist (collapsible trigger below block list) Closes #231 ## Changes by layer ### Python OCR Service - Extract `extract_page_blocks()` from both Surya and Kraken engines for per-page processing - Add `POST /ocr/stream` NDJSON endpoint with `start`/`page`/`error`/`done` event protocol - Per-page error handling: log traceback, yield generic error, continue with next page - `X-Accel-Buffering: no` + `Cache-Control: no-cache` headers for reverse-proxy compatibility - 38 tests, 88% coverage on production code ### Java Backend - `OcrStreamEvent` sealed interface with `Start`/`Page`/`Error`/`Done` record subtypes - `OcrClient.streamBlocks()` default method synthesizes events from `extractBlocks()` (backward compat) - `RestClientOcrClient.streamBlocks()` parses NDJSON with dedicated `ObjectMapper`, falls back to `/ocr` on 404 - `OcrAsyncRunner.runSingleDocument()` rewritten to use streaming with per-page block persistence - Batch path (`runBatch`) unchanged — stays on old `extractBlocks()` ### Frontend - Extract `translateOcrProgress` to testable `$lib/ocr/` module with structured return type - Parse `ANALYZING_PAGE:current:total:blocks` and `DONE:count:skipped` progress codes - Inline skipped-pages warning in amber when pages fail during OCR - Collapsible "OCR erneut ausführen…" trigger in edit mode when blocks exist - i18n keys added for de/en/es ## Test plan - [ ] Trigger OCR on a multi-page PDF — verify per-page progress messages appear - [ ] Verify transcription blocks are created incrementally (visible after each page) - [ ] Kill OCR mid-stream — verify partial blocks are preserved, re-trigger works - [ ] Verify annotations appear on correct pages (1-based, not 0-based) - [ ] Verify re-run OCR option appears in edit mode when blocks exist - [ ] Run `cd backend && ./mvnw test` — 831 tests pass - [ ] Run `cd frontend && npm run test` — 698 tests pass - [ ] Run `cd ocr-service && python -m pytest` — 38 tests pass
marcel added 4 commits 2026-03-27 10:04:19 +01:00
Test 6 (delete annotation): the mouse-draw test can create multiple
annotations in CI. Changed the assertion to `countBefore - 1` instead
of a hard-coded 0, so the test is resilient to any pre-existing count.

Test 7 (hash versioning): `[data-testid^="annotation-"]` matched both
real annotation elements AND `annotation-outdated-notice` (which also
starts with "annotation-"), inflating the count to 2 instead of 0.
Added `:not([data-testid="annotation-outdated-notice"])` to exclude the
notice from the count assertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a document is created without an explicit title (null or blank),
the service now derives the title from the uploaded filename using the
same titleFromFilename() logic already used by storeDocument — stripping
the extension for plain names and formatting structured names as
"Firstname Lastname (DD.MM.YYYY)".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(#68): lead new document form with file upload, all metadata optional
Some checks failed
CI / Unit & Component Tests (push) Failing after 1m17s
CI / Backend Unit Tests (push) Failing after 9h3m48s
CI / E2E Tests (push) Failing after 28m15s
c5e28ac18e
Restructure the "New Document" page so users can save quickly:

- FileSectionNew becomes the first element, redesigned as a prominent
  upload zone with an icon and large click target
- Title field is rendered standalone below the upload zone; it
  auto-populates from the filename (via parseFilename + stripExtension
  fallback) unless the user has already typed something
- All remaining metadata (who/when, description, transcription) moves
  into a collapsible "Weitere Details" section that auto-expands when
  URL prefill data or a form error is present, or when filename parsing
  detects a date/person
- title is no longer required — the form can be saved with only a file
- DescriptionSection gains a `hideTitle` prop for use in this layout
- `form_label_title` translation key no longer carries a hardcoded `*`;
  the asterisk is rendered by the template only when `titleRequired` is
  set (currently only the edit form)
- E2E tests added for all three scenarios from the issue

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(#68): hide native file input, show selected filename in upload zone
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 2m47s
CI / Backend Unit Tests (push) Has been cancelled
CI / E2E Tests (push) Has been cancelled
CI / Unit & Component Tests (push) Has been cancelled
CI / E2E Tests (pull_request) Has been cancelled
CI / Backend Unit Tests (pull_request) Has been cancelled
a7eaa40852
The native browser file input showed an untranslatable "Browse…" button
and "No file selected" text. The input is now sr-only; the large upload
zone label acts as the sole click target. When a file is selected its
name replaces the prompt text inside the zone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
marcel merged commit a7eaa40852 into main 2026-03-27 10:04:33 +01:00
marcel deleted branch feature/68-new-document-file-first 2026-03-27 10:04:34 +01:00
marcel changed title from feature/68-new-document-file-first to feat: page-by-page streaming OCR with real-time progress 2026-04-13 10:55:29 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#88