Commit Graph

4 Commits

Author SHA1 Message Date
Marcel
70689b8f7b feat(ocr): add SSRF protection for PDF URL downloads
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 1s
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 0s
Validates PDF download URLs against an ALLOWED_PDF_HOSTS allowlist
(default: minio,localhost,127.0.0.1) and disables redirect following
to prevent redirect-based SSRF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 12:29:42 +02:00
Marcel
69768a104d test(ocr): add business-logic tests for polygon extraction, Kraken routing, and confidence markers
Cover Surya polygon/word-level extraction, health endpoint states,
Kraken script-type routing, 503 when models not ready, 400 when
Kraken unavailable for Kurrent, and confidence marker application
during streaming. Production code coverage: 88%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:34:23 +02:00
Marcel
97e5138934 fix(ocr): use 1-based page numbers to match frontend PDF viewer
The PDF viewer uses 1-based currentPage (starting at 1) but the OCR
engines produced 0-based pageNumber from enumerate(). Annotations
created by OCR were assigned to page 0, which doesn't exist in the
viewer. Change enumerate() to start=1 in both engines and the
streaming endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 10:32:08 +02:00
Marcel
97c6cf6a65 feat(ocr): add NDJSON streaming endpoint POST /ocr/stream
Streams one JSON line per completed page instead of buffering the
entire result. Emits start/page/error/done events. On per-page
failure, logs the traceback but yields a generic error message and
continues with the next page. Adds X-Accel-Buffering: no and
Cache-Control: no-cache headers for reverse-proxy compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 09:57:57 +02:00