Commit Graph

8 Commits

Author SHA1 Message Date
Marcel
131ed336bc feat(ocr): count words and illegible words at the OCR call sites
Walks block["words"] before apply_confidence_markers strips the list, then
increments ocr_words_total by len(words) and ocr_illegible_words_total by
the count below threshold. Same pattern in both /ocr and /ocr/stream so the
ratio illegible/words is a faithful quality signal across endpoints.

Refs #652 (AC4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:07:59 +02:00
Marcel
3fa3460dbf feat(ocr): increment ocr_skipped_pages_total on per-page engine failure
Bumps the counter in both /ocr/stream except blocks (standard and guided
generators) so the existing skipped_pages local variable now also flows
into Prometheus.

Refs #652 (AC3b)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:06:50 +02:00
Marcel
79edb94558 feat(ocr): increment ocr_pages_total per successful page in stream
Bumps the counter inside both the standard and guided /ocr/stream
generators after a page yields its blocks, before the per-page json line is
emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after
engine selection so the counter still fires when a page later errors out.

Refs #652 (AC3a)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:05:36 +02:00
Marcel
52d8dc2b20 test(ocr): assert ocr_jobs_total label is engine=surya for typewriter
Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that
sets engine_name from script_type now has explicit coverage for both
HANDWRITING_KURRENT → kraken and TYPEWRITER → surya.

Refs #652 (AC2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:04:20 +02:00
Marcel
696b71da5a feat(ocr): increment ocr_jobs_total with engine and script_type labels
Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise,
then increment after the blocks have been extracted.

Refs #652 (AC2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:03:37 +02:00
Marcel
f3e3545d06 feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support
Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and
exposes a `build_metrics(registry)` factory. Production main.py binds against
the default REGISTRY; tests construct a fresh CollectorRegistry per case and
monkeypatch main.metrics, so counter values stay isolated between tests
(decision #3 on issue #652, Option A).

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:02:20 +02:00
Marcel
4bb6685edb test(ocr): assert http_* metrics appear after an /ocr request
Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing
http_requests_total and http_request_duration_seconds for application
traffic, not just register the /metrics endpoint.

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 16:00:33 +02:00
Marcel
18c93d4eaa feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator
Mount the instrumentator immediately after FastAPI app creation, excluding
/health and /metrics from request metrics to keep http_requests_total focused
on real application traffic.

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 15:59:37 +02:00