feat: page-by-page streaming OCR with real-time progress #88
Reference in New Issue
Block a user
Delete Branch "feature/68-new-document-file-first"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
POST /ocr/stream) — Python sends one JSON line per completed page, Java consumes incrementallyCloses #231
Changes by layer
Python OCR Service
extract_page_blocks()from both Surya and Kraken engines for per-page processingPOST /ocr/streamNDJSON endpoint withstart/page/error/doneevent protocolX-Accel-Buffering: no+Cache-Control: no-cacheheaders for reverse-proxy compatibilityJava Backend
OcrStreamEventsealed interface withStart/Page/Error/Donerecord subtypesOcrClient.streamBlocks()default method synthesizes events fromextractBlocks()(backward compat)RestClientOcrClient.streamBlocks()parses NDJSON with dedicatedObjectMapper, falls back to/ocron 404OcrAsyncRunner.runSingleDocument()rewritten to use streaming with per-page block persistencerunBatch) unchanged — stays on oldextractBlocks()Frontend
translateOcrProgressto testable$lib/ocr/module with structured return typeANALYZING_PAGE:current:total:blocksandDONE:count:skippedprogress codesTest plan
cd backend && ./mvnw test— 831 tests passcd frontend && npm run test— 698 tests passcd ocr-service && python -m pytest— 38 tests passfeature/68-new-document-file-firstto feat: page-by-page streaming OCR with real-time progress