As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
prometheus.ymlalready declares ajob_name: ocr-servicescrape target atocr:8000/metrics, but the endpoint does not exist — the target shows as DOWN in Prometheus. No code changes to the observability stack are needed; only the OCR service itself needs to be updated.The OCR service is a FastAPI (Python) app. Adding metrics requires three changes:
requirements.txt/metricsendpoint inmain.pyUser Story
Acceptance Criteria
AC1 —
/metricsendpoint exists and is scrapedGiven the observability stack is running,
When Prometheus scrapes
ocr:8000/metrics,Then the
ocr-servicetarget shows as UP in Prometheus (currently DOWN).Given a
GET /ocrorPOST /ocr/streamrequest is handled,When
/metricsis scraped,Then
http_requests_totalandhttp_request_duration_secondsare present for those endpoints (auto-instrumented byprometheus-fastapi-instrumentator).AC2 — OCR job counters
Given
POST /ocrorPOST /ocr/streamcompletes (success or error),When
/metricsis scraped,Then
ocr_jobs_totalis incremented, labelled by:engine:krakenorsuryascript_type:HANDWRITING_KURRENT,HANDWRITING_LATIN,TYPEWRITER,UNKNOWNAC3 — Page-level counters
Given each page is processed inside the streaming generator,
When
/metricsis scraped,Then:
ocr_pages_total(labelledengine) is incremented per successfully processed pageocr_skipped_pages_totalis incremented when a page raises an exception (skipped_pages += 1in bothgenerate()andgenerate_guided())AC4 — Confidence / quality counters
Given
apply_confidence_markers()processes a word list,When
/metricsis scraped,Then:
ocr_words_totalis incremented by the number of words in the blockocr_illegible_words_totalis incremented by the number of words below the confidence threshold (those replaced with[unleserlich])This ratio (
ocr_illegible_words_total / ocr_words_total) is the primary OCR quality signal.AC5 — Processing duration
Given the OCR engine runs in
asyncio.to_thread(engine.extract_blocks, ...),When
/metricsis scraped,Then
ocr_processing_seconds(Histogram, labelledengine) captures per-document processing time — measured around the thread call inrun_ocrand per-page inrun_ocr_stream.AC6 — Training metrics
Given
POST /trainorPOST /train-sendercompletes,When
/metricsis scraped,Then:
ocr_training_runs_total(labelledoutcome:success/error) is incrementedocr_model_accuracy(Gauge) is set to theaccuracyvalue returned by_parse_best_checkpoint()after a successful runAC7 — Model ready gauge
Given the lifespan startup completes and
_models_ready = True,When
/metricsis scraped,Then
ocr_models_ready(Gauge) is1.0. Before startup completes it is0.0.Implementation Notes
Dependency
Add to
requirements.txt:(
prometheus-clientis a transitive dependency — no separate entry needed.)Mounting the endpoint
In
main.py, directly afterapp = FastAPI(...):Custom metric declarations
Declare module-level in
main.py(importCounter,Histogram,Gaugefromprometheus_client):Increment these at the existing call sites — no structural changes to the request handling logic.
apply_confidence_markers()— word countingconfidence.pyalready iterateswords; the counter increments can be added there or at the call site inmain.py(caller has access toblock["words"]beforeapply_confidence_markersstrips them).Non-Functional Requirements
/metricsmust not add measurable latency to OCR requests —prometheus_fastapi_instrumentatoruses an async middleware, this is safe.prometheus.ymlordocker-compose.observability.ymlare needed — the scrape target is already configured./metricsendpoint must return HTTP 200 and includeocr_jobs_totalafter at least one OCR request in the existingtest_main.pyintegration tests (or a new dedicated test).Out of Scope
Once this is merged, #651 (PO overview dashboard) can be extended with Row 4 — OCR Health using the metrics exposed here. The four panels planned for that row are:
sum(increase(ocr_jobs_total[$__range]))sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))ocr_models_ready👨💻 Felix Brandt — Senior Fullstack Developer
Observations
Instrumentator().instrument(app).expose(app)) must go afterapp = FastAPI(...)but before the lifespan runs. Looking atmain.py, theappobject is created at module level and the lifespan is passed to the constructor — this order is correct, but the two-liner must be inserted between line 73 (app = FastAPI(...)) and@app.get("/health"). Inserting it anywhere else may cause the/metricsendpoint to be registered after startup, which is fine for Prometheus scrapes but worth being explicit about.apply_confidence_markers()inconfidence.pyiterateswordsinternally. The issue proposes countingocr_words_totalandocr_illegible_words_totaleither insideconfidence.pyor at the call site inmain.pywhereblock["words"]is still present. The call-site approach inmain.pyis cleaner: the caller has access toblock["words"]beforeapply_confidence_markersstrips them, andconfidence.pyremains a pure function with no side effects. Recommendation: keep counters inmain.pyat the call site.generate()andgenerate_guided()) are closures insiderun_ocr_stream. Module-level counters are accessible in closures — no scope issue. Theocr_skipped_pages_totalincrement should go in bothexceptblocks (lines 259 and 205 ofmain.py). The issue already calls this out correctly.Counter,Histogram,Gaugeobjects at module level should be typed (e.g.,ocr_jobs_total: Counter) for IDE support and static analysis clarity.Recommendations
ocr_jobs_total: Counter = Counter(...). This follows the existing_models_ready: boolandALLOWED_PDF_HOSTS: setpattern.ocr_processing_secondsHistogram should wrap theasyncio.to_thread(...)call inrun_ocr(non-streaming) and each page'sasyncio.to_thread(...)call inrun_ocr_stream. Use a context manager:with ocr_processing_seconds.labels(engine=engine_name).time(): ...— but sinceasyncio.to_threadisawait-ed, usetime()around the awaited call or capturestart = time.monotonic()before and.observe()after.AsyncClient(ASGITransport(app=app))— consistent with existingtest_main.pypattern — and call/ocrwith a mocked PDF download and engine, then assert/metricsreturns 200 and containsocr_jobs_total.test_metrics_endpoint_returns_200_and_includes_ocr_jobs_total_after_ocr_request.Open Decisions
ocr_words_total: call site inmain.pyvs. insideconfidence.py. Call site is cleaner (pure function stays pure), butconfidence.pyis the only place that sees every word. The issue correctly identifies both options. Call site wins on principle — recommend it, but either works.🏗️ Markus Keller — Application Architect
Observations
prometheus.ymlordocker-compose.observability.ymlare needed — the scrape targetocr:8000/metricsis already configured. This is the right starting point: fix the thing that's broken (the service not exposing/metrics), not the observer.prometheus-fastapi-instrumentator==7.0.0pins the version — good.prometheus-clientis correctly identified as a transitive dependency. No separaterequirements.txtentry needed.docs/architecture/c4/l2-containers.puml) should already show theocr-servicecontainer. No new infrastructure component is being added — the/metricsendpoint is a new capability on an existing container, not a new service. This does not require a diagram update, which is correct.Instrumentator().instrument(app).expose(app)auto-instruments all FastAPI routes. This will also instrument/train,/train-sender, and/segtrainwith HTTP duration metrics. That's fine and useful, but worth knowing — training requests can take minutes and will appear as outliers inhttp_request_duration_seconds. This is expected behavior, not a bug.Recommendations
/metricsendpoint exposed byprometheus-fastapi-instrumentatordefaults to unauthenticated access on the same port (8000). Sinceocr:8000is on the internal Docker network and not exposed via Caddy, this is acceptable. No auth needed on/metricsfor internal scraping.ocr_models_readyGauge (AC7) is a clean health signal for Prometheus alerting rules — more useful than/healthfor time-series. Good addition.ocr_processing_secondsHistogram (AC5) is the most valuable metric here — it answers "is OCR getting slower over time?" which grepping logs cannot. Make sure the label isengineonly (notscript_type) to keep cardinality low; the issue spec has this right.Open Decisions
None. This is a well-bounded instrumentation task with a clear implementation path.
🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
/metricsendpoint:prometheus-fastapi-instrumentatorexposes/metricswithout authentication by default. On this stack,ocr:8000is on the internal Docker network and is not routed through Caddy to the public internet. That makes unauthenticated access acceptable — only containers on thefamilienarchiv_defaultnetwork can reach it. No action needed, but this should be confirmed before any future port exposure.Counter,Histogram, andGaugemetrics are write-only from application code. They expose aggregate numeric data (counts, durations, accuracy scores) — no user input, no PII, no document content. The/metricsoutput is safe to expose internally.ocr_model_accuracyGauge (AC6): this value comes from_parse_best_checkpoint(), which parses filenames from a training temp directory. The parsed float is bounded by the regex[0-9.]+and cast tofloat(). No injection vector here.TRAINING_TOKENcheck on/trainand/train-senderis unchanged. The new metrics will showocr_training_runs_totalbut reveal only outcome counts, not training data or model weights. No security regression.enginelabel takes valueskrakenorsurya;script_typetakes one of four known values;outcometakessuccessorerror. All are application-controlled constants, not user-supplied strings. No label injection risk.Recommendations
networks:audit thatocr:8000has noports:mapping in the production Compose file. If the port is ever mapped to the host, add basic auth to the/metricsendpoint usingInstrumentator(..., should_respect_env_var=True)with an env-var guard./metricsendpoint should be excluded from the access log to avoid noise. FastAPI/Uvicorn logs every request by default —prometheus_fastapi_instrumentatordoes not suppress this. Consider a log filter inmain.py:logging.getLogger("uvicorn.access").addFilter(MetricsPathFilter()).Open Decisions
None. The threat model for this change is low-risk given the internal-network-only exposure.
🧪 Sara Holt — QA Engineer & Test Strategist
Observations
ocr_jobs_totalafter at least one OCR request" is a start, but the test strategy should cover more than the happy path. See recommendations below.test_main.pyusesAsyncClient(ASGITransport(app=app))withpatch()for heavy dependencies — this is the right pattern and should be followed for the new metrics tests.run_ocr,run_ocr_stream,generate_guided,apply_confidence_markerscall sites,/train, lifespan). Each one needs at least one test that proves the increment actually fires.prometheus_client. In Python tests, counters accumulate across test runs in the same process. This is a flakiness risk: iftest_Aincrementsocr_jobs_totalandtest_Basserts on its value, the assertion depends on run order. Mitigation: useREGISTRY.unregister()+ re-register in apytest.fixturewithautouse=True, or assert on_value.get()relative increments rather than absolute values.ocr_models_ready): the lifespan test pattern intest_main.pyalready patcheskraken_engine.load_modelsandload_spell_checker. The new test should assertocr_models_ready._value.get() == 1.0after the lifespan context completes.Recommendations
test_metrics_endpoint_returns_200test_ocr_jobs_total_incremented_after_successful_ocrtest_ocr_jobs_total_incremented_with_correct_engine_and_script_type_labelstest_ocr_pages_total_incremented_per_page_in_streamtest_ocr_skipped_pages_total_incremented_on_page_exceptiontest_ocr_words_and_illegible_words_total_reflect_confidence_markerstest_ocr_processing_seconds_histogram_observed_after_ocrtest_ocr_training_runs_total_incremented_on_success_and_errortest_ocr_model_accuracy_gauge_set_after_successful_trainingtest_ocr_models_ready_gauge_is_1_after_startuppytest.fixturethat callsprometheus_client.REGISTRY.unregister(counter)before each test that asserts counter values, then re-registers — or use theprometheus_client.CollectorRegistry()pattern to create test-scoped registries and pass them toCounter(...).test_metrics.py(separate fromtest_main.py) to keep test files focused.Open Decisions
registry=to everyCounter(...)declaration — which means the module-level declarations inmain.pywould need to accept an injectable registry. Relative-increment assertions are simpler but more fragile. Given this is a solo project, relative-increment assertions are pragmatic enough for now.🚀 Tobias Wendt — DevOps & Platform Engineer
Observations
prometheus.ymlscrape targetjob_name: ocr-serviceatocr:8000/metricsis already configured and showing as DOWN. This issue closes that gap without any infra changes — ideal.prometheus-fastapi-instrumentator==7.0.0is pinned with an exact version — correct.prometheus-clientas a transitive dep is fine; Renovate will pick up both.prometheus-fastapi-instrumentatortorequirements.txtmeans the Docker image must be rebuilt. Theprometheus-clientwheel is small (~100 KB), butprometheus-fastapi-instrumentatoritself may pull additional deps. Checkpip show prometheus-fastapi-instrumentatorfor transitive deps before the PR — the OCR image is already large (~5 GB for Surya/Torch) so this addition is negligible.prometheus-fastapi-instrumentatorauto-instruments/healthtoo. Prometheus scrapes of/healthby the existing healthcheck (Docker Composetest: ["CMD", "curl", "-f", "http://localhost:8000/health"]) will show up inhttp_requests_total{path="/health"}. This is noise in the metrics but harmless. Consider filtering/healthfrom instrumentation:Instrumentator(excluded_handlers=["/health", "/metrics"]).ocr_models_readyGauge: once the service is healthy (models loaded), this Gauge = 1.0. If the service restarts and hasn't finished loading yet, it will briefly be 0.0. This is a useful alerting signal:ocr_models_ready < 1for > 90s → alert. Pairs well with the Prometheus scrape target status already shown in issue comment.docker-compose.observability.yml: confirmed by the issue. Theocr-servicejob inprometheus.ymlis already there — just waiting for the endpoint to exist.Recommendations
http://localhost:9090/targets) before closing the issue.excluded_handlers=["/health", "/metrics"]toInstrumentator()to keephttp_requests_totalclean —/healthand/metricsare infrastructure noise, not application traffic.ocr_models_readyGauge should be the basis for a Prometheus alerting rule in a future issue:alert: OcrServiceNotReady, expr: ocr_models_ready < 1, for: 2m. Not in scope here, but flag it.prometheus-fastapi-instrumentatorversion7.0.0should be added to the Renovate config (or verified it's already covered by therequirements.txtpattern) so version bumps create PRs automatically.Open Decisions
None. The infra side of this issue is already done — the only remaining work is in the Python service.
📋 Elicit — Requirements Engineer
Observations
ocr_jobs_totalis labelled byengineandscript_type. Inmain.py, the engine is determined byuse_kraken(bool) and mapped tokraken_engineorsurya_engine. Theenginelabel value ("kraken"or"surya") needs to be derived from this — the issue doesn't specify exactly which string constant to use. Recommend agreeing on"kraken"and"surya"as the canonical label values (lowercase, matching engine module names).run_ocr" and "per-page inrun_ocr_stream" are different granularities. The issue says to measure "around the thread call inrun_ocrand per-page inrun_ocr_stream". This is correct, but the two measurement points produce incomparable data: one measures full-document duration, the other per-page. The Grafana query for "average OCR duration" needs to know which one to use. Clarification: the per-page histogram inrun_ocr_streamis more useful for operational insight. Therun_ocrmeasurement is a bonus./segtrain: the issue lists AC6 only forPOST /trainandPOST /train-sender. Looking atmain.py, there is a third training endpoint:/segtrain(segmentation model training). Shouldocr_training_runs_totalalso be incremented for/segtrain? The issue is silent on this.ocr_jobs_totalafter at least one OCR request" covers only AC1. ACs 2–7 have no corresponding NFR-TEST entries. This is acceptable for an MVP but leaves significant metric behavior untested.ocr_models_readyGauge (AC7) default value: before startup completes it should be0.0. Prometheus counters/gauges initialize to 0 by default inprometheus_client, soocr_models_readywill naturally be 0.0 until explicitly set to 1.0 in the lifespan. This should be called out in the implementation note.Recommendations
ocr_models_readyGauge initializes to0.0by default inprometheus_client; no explicit initialization to 0 is needed — only theset(1.0)call after startup completes."/segtraingap: either add it to AC6 explicitly, or add a note "out of scope — tracked in a follow-up." Silence here will cause the implementer to make an undocumented decision.enginelabel strings:"kraken"and"surya"(recommended), consistent with the Python module names inengines/.Open Decisions
/segtraincoverage inocr_training_runs_total: include or exclude? Segmentation training is a separate model (blla), a different endpoint, and less frequently used. Including it adds completeness; excluding it keeps AC6 simpler. Either is valid — but the implementer needs an explicit answer to avoid a silent decision.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
This is a pure developer-facing observability feature — no user-facing UI is introduced or modified. The
/metricsendpoint, Prometheus counters, and Grafana panels are consumed by developers and operators, not by Familienarchiv end users (the 60+ Kurrent transcribers or the millennial readers). This issue has no UX or accessibility implications.What I checked:
/metricsendpoint response is plain text (Prometheus exposition format) — not rendered in any user-facing context.No concerns from my angle. When #651 (Grafana dashboard) is implemented, I'll review the panel layout and information hierarchy at that point.
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Implementation
ocr_words_total/ocr_illegible_words_total— call site inmain.py(beforeapply_confidence_markersstripsblock["words"]) vs. insideconfidence.py(where every word is always visible). Call site keepsconfidence.pya pure function with no side effects; insideconfidence.pyensures the counts are never missed if the function is called from a future code path. Recommended: call site inmain.py. (Raised by: Felix)Requirements
/segtraincoverage inocr_training_runs_total(AC6) — AC6 specifiesPOST /trainandPOST /train-sender, butmain.pyalso hasPOST /segtrain(segmentation model fine-tuning). Shouldocr_training_runs_totalbe incremented there too? Including it adds completeness; excluding it keeps the first implementation simpler. Either choice is valid — but silence leaves the implementer guessing. (Raised by: Elicit)Testing
test_metrics.py— PrometheusCounter/Gaugeobjects are module-level singletons; values accumulate across tests in the same process. Two approaches: (A) inject a test-scopedCollectorRegistry()into every metric declaration (cleaner, requiresregistry=param on eachCounter(...)inmain.py), or (B) assert on relative increments (value_after - value_before) rather than absolute values (simpler, slightly more fragile). Given solo project constraints, option B is pragmatic. (Raised by: Sara)Call site is main.py. We should do seperate training runs, segtrain and real OCR training. We will do A injection.
✅ Implementation complete — PR #653
Implemented via TDD on
feat/issue-652-ocr-metrics. PR: #653Commits (red/green/refactor, atomic)
feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator(18c93d4e)test(ocr): assert http_* metrics appear after an /ocr request(4bb6685e)feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support(f3e3545d)feat(ocr): increment ocr_jobs_total with engine and script_type labels(696b71da)test(ocr): assert ocr_jobs_total label is engine=surya for typewriter(52d8dc2b)feat(ocr): increment ocr_pages_total per successful page in stream(79edb945)feat(ocr): increment ocr_skipped_pages_total on per-page engine failure(3fa3460d)feat(ocr): count words and illegible words at the OCR call sites(131ed336)feat(ocr): observe ocr_processing_seconds around engine.to_thread calls(2e3744d9)feat(ocr): record training runs in ocr_training_runs_total per kind and outcome(6c2b9af1)test(ocr): assert ocr_model_accuracy gauge is set per kind on success(77d59c5d)feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes(d6abf990)feat(ocr): suppress uvicorn access logs for /metrics and /health(525f091b)ops(observability): drop TODO from ocr-service scrape job in prometheus.yml(e75ac8ec)Decisions implemented as resolved
main.py— increment at the call sites in both/ocrand/ocr/streambeforeapply_confidence_markersstripsblock["words"];confidence.pystays a pure function./segtraintracked separately via label-based separation:ocr_training_runs_total{kind, outcome}andocr_model_accuracy{kind}withkind=recognition(/train+/train-sender) vskind=segmentation(/segtrain).CollectorRegistry()via newmetrics.pyfactory +fresh_metricspytest fixture usingmonkeypatch.setattr("main.metrics", build_metrics(CollectorRegistry())).Test results
ocr-service/test_metrics.py, all greentest_metrics.py + test_main.py + test_training_auth.py + test_confidence.py. The one failure (test_startup_logs_warning_when_running_as_root) was already failing onmainand is unrelated —ASGITransportdoes not auto-run the lifespan, so itscaplogassertion never sees the warning. (My newocr_models_readylifespan test drives the lifespan viaapp.router.lifespan_context(app)to avoid the same trap.)Next steps
http://localhost:9090/targets.ocr_*metric names for the Row 4 panels.