feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653
Reference in New Issue
Block a user
Delete Branch "feat/issue-652-ocr-metrics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #652.
Summary
prometheus-fastapi-instrumentator==7.0.0soocr:8000/metricsserves the Prometheus exposition format (Prometheus target flips from DOWN → UP after image rebuild).metrics.pyfactory that returns a frozenOcrMetricsdataclass bound to aCollectorRegistry. Production binds against the defaultREGISTRY; tests pass a freshCollectorRegistry()andmonkeypatchmain.metrics, so counter state is isolated between cases (decision #3, option A).ocr_jobs_total{engine, script_type}— AC2ocr_pages_total{engine},ocr_skipped_pages_total— AC3ocr_words_total,ocr_illegible_words_total(summed at the call sites inmain.pyper decision #1) — AC4ocr_processing_seconds{engine}Histogram around everyasyncio.to_thread(engine.extract_*)call — AC5ocr_training_runs_total{kind, outcome}andocr_model_accuracy{kind}withkind=recognitionfor/train+/train-sender,kind=segmentationfor/segtrain(decision #2) — AC6ocr_models_readyGauge flipped to 1 once the lifespan startup finishes — AC7MetricsPathFilteronuvicorn.accessto drop log records for/metricsand/health(Nora's recommendation).infra/observability/prometheus/prometheus.yml— the scrape target itself was already pointing atocr:8000/metricsand just needed the endpoint to exist.Test plan
pytest ocr-service/test_metrics.py— 16 new tests, all greenpytest ocr-service/test_main.py test_training_auth.py test_confidence.py test_metrics.py— 43 of 44 pass (the one failuretest_startup_logs_warning_when_running_as_rootis pre-existing onmainand unrelated;ASGITransportdoes not run lifespan by default, so the lifespan-based assertion in that test never sees the warning)curl ocr:8000/metricsreturns 200 and theocr-servicetarget shows UP athttp://localhost:9090/targetsPOST /ocrrequest:curl -s ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_'lists the wired metricsOut of scope
ocr_models_ready < 1 for 2m) — flagged for a follow-up by Tobias🤖 Generated with Claude Code
Felix Brandt — Senior Fullstack Developer
Verdict: Approved with concerns
Clean, well-tested change. TDD evidence is strong (16 new tests, fresh-registry isolation per decision #3, each test names a single behaviour as a sentence).
metrics.pyis small, frozen-dataclass, single-purpose — exactly the size I want for new modules.build_metrics(registry)is testable by construction. Type hints, docstrings, andfrom __future__ import annotationsare all in place.Blockers
ocr-service/main.py:144-147, 251-254, 285-288— duplicated word-counter logic. The five-line block summinglen(words)andsum(1 for w in words if w["confidence"] < threshold)now appears three times. Extract a private helper:Three duplications is the rule-of-three trigger; the helper also makes the test surface smaller.
ocr-service/main.py:498-504, 585-591, 702-708— duplicated try/except for training counters. Same three-block pattern around everyawait asyncio.to_thread(_run_*). Extract:Concerns
ocr-service/main.py:201-219(guided streaming) —page_startedtaken AFTERawait asyncio.to_thread(preprocess_page, ...). That preprocess call is the most expensive step per page and it lands in zero histograms. In the unguided path on line 288 you start the timer just before preprocessing — pick one definition and document it on the metric description, or measure both.ocr-service/main.py:185(run_ocr_stream) —ocr_jobs_total.inc()fires before the document is downloaded. If the PDF download or page render fails, the job counter has already been incremented andocr_pages_totalwill be zero — analytics will read "1 job, 0 pages, 0 skipped." Inrun_ocr(line 142) the increment correctly lands after extract returns. Move the stream increment after the first successful page, or after the download.Nits (skip if tight)
metrics.py:38-91—build_metrics()is 53 lines of declaration; readable as a config block, no need to split.test_metrics.py:106-128, 137-159, etc.— six tests open with the same sixpatch()calls. A@pytest.fixturereturning the entered context would shrink each test by ~10 lines, but the explicit form is fine to keep.test_metrics.py:106— fixture usesmonkeypatch.setattr("main.metrics", ...), but the productionmetricssymbol is captured by closure inside the endpoint functions at call time (good); just worth a one-line comment explaining why the patch works.Markus Keller — Senior Application Architect
Verdict: Approved with concerns
The factory + registry-injection pattern is the right call here. It keeps the
prometheus_clientglobal out of the test surface, the dataclass freezes the contract, and the module boundary betweenmetrics.pyandmain.pyis clean. Boring, monolithic, no new infrastructure introduced — exactly the kind of incremental change I prefer.Blockers
docs/architecture/c4/l2-containers.puml— the OCR container now exposes a Prometheus scrape endpoint. The arrow Prometheus → ocr-service either does not exist on the L2 diagram or is unannotated. Add it.docs/OBSERVABILITY.md— new custom metrics (ocr_jobs_total,ocr_pages_total,ocr_skipped_pages_total,ocr_words_total,ocr_illegible_words_total,ocr_processing_seconds,ocr_training_runs_total,ocr_model_accuracy,ocr_models_ready) must be listed with their labels and units, or an operator reading the doc will not know they exist.prometheus.yml— fine, but no ADR was added for "we use prometheus-fastapi-instrumentator + a per-module factory for registry injection." This is the architectural decision that justifies decision #3 on issue #652; capture it indocs/adr/so the next maintainer does not undo the factory pattern.Concerns
ocr-service/main.py:42—metrics: OcrMetrics = build_metrics(REGISTRY)runs at module import. Side effect at import time. It works (and you have to do it for the instrumentator to find the registry), but document it in the module docstring. Otherwise a future refactor that lazy-importsmainfor any reason will silently double-register and crash.ocr-service/main.py:82—Instrumentator(...).instrument(app).expose(app)mounts/metricsunauthenticated. Acceptable today becauseocr:8000is on the internal Docker network and not behind Caddy — but the moment someone exposes the OCR container, this leaks request counts and timing. Add a comment inmain.pynext to theInstrumentator(...)call naming the threat model ("internal network only; never exposeocr:8000publicly"), or have Nora speak to it.Nits
OcrMetricsis a single feature module — no boundary violation. Cross-domain access is N/A here.Tobias Wendt — DevOps & Platform Engineer
Verdict: Approved with concerns
Long-awaited cleanup — the Prometheus target finally flips from DOWN to UP, the TODO in
prometheus.ymlwas a real operational smell, and the access-log filter is exactly what I would have asked for. Good choice onprometheus-fastapi-instrumentator==7.0.0(pinned, not>=).Blockers
infra/observability/prometheus/prometheus.ymlhas no scrape interval / timeout / metrics_path freshness check after the change. The diff only removes the TODO and leavesmetrics_path: /metrics+targets: ['ocr:8000']. Verify by hand (or attach a screenshot to this PR) thathttp://localhost:9090/targetsshows the ocr-service target as UP with last-scrape < 30s. The PR's Test plan still has this checkbox unchecked — that must be ticked before merge, not after.No alerting rule for
ocr_models_ready < 1 for 2m. The PR explicitly defers this to "a follow-up." Fine — but open a Gitea issue and link it from this PR before merge. The gauge is worthless on its own; an alert is what makes it operational. Without a tracking issue, "follow-up" is how alert gaps live for six months.Concerns
ocr-service/main.py:82—Instrumentator().expose(app)adds a public/metricsendpoint with no auth. Today the OCR container is only reachable from inside the compose network, so this is fine. If anyone ever puts ocr-service behind Caddy directly, this leaks. Add a Caddy block indocs/DEPLOYMENT.mdshowingrespond /metrics 404if/when ocr-service goes public. Same applies to/healthalready, so document both together.ocr-service/requirements.txt:13—prometheus-fastapi-instrumentator==7.0.0is pinned, but its transitiveprometheus-clientis not. That is mostly fine because pip's resolver will pick a compatible version, but it means the Docker image is not reproducible across base-image rebuilds. Either addprometheus-client==<x.y.z>to the requirements or commit arequirements.lock/pip-compileoutput. Renovate will keep both up to date.The
MetricsPathFilteronuvicorn.accessis a Python-side log filter — once we ship Loki, the noise re-appears at the container-log layer. Either also filter in Promtail/Alloy config, or note indocs/OBSERVABILITY.mdthat the suppression is local to uvicorn's logger so future Loki users do not chase a "missing log" ghost.Nits
ocr-service/test_metrics.pyruns synchronously againstASGITransport(app=app)— fast, no docker required. Plays nicely with CI.test_startup_logs_warning_when_running_as_rootfailure is unrelated; thanks for calling that out so I did not chase it.Nora "NullX" Steiner — Application Security Engineer
Verdict: Approved with concerns
Mostly clean. Custom metrics carry no PII (counters/gauges over engine/script_type/kind/outcome labels, no usernames, no document IDs, no PDF URLs). The training endpoints still gate on
X-Training-Token. No new injection surface, no shell exec, no untrusted deserialization.Blockers
None.
Concerns
ocr-service/main.py:82— unauthenticated/metricsendpoint.Instrumentator().instrument(app).expose(app)mounts/metricswith no auth and no IP allowlist. The default exporter exposes:http_requests_total{handler=...}— this reveals every URL path the service exposes, including the existence of/train,/train-sender,/segtrain. An attacker who reaches the metrics endpoint learns the service topology for free.Threat model: today
ocr:8000is only reachable from inside the compose network, so this is acceptable. The risk is operational drift — if someone ever routes ocr-service through Caddy, this becomes public information disclosure (CWE-200). Mitigation, in order of preference:respond /metrics 404for any future public route (coordinate with Tobias)./metricsto a separate management port viaInstrumentator(...).expose(app, endpoint="/metrics", include_in_schema=False, tags=["internal"])and document the internal-network requirement in a comment above line 82 stating the threat model (per my "comments explain the threat model" rule).ocr-service/main.py:93-97—MetricsPathFilterreadsrecord.args[2]assuming uvicorn's positional format. Defensive but path-tolerant — good. However, if uvicorn ever changes its format string to keyword args, this silently turns off and/metricsrequests get logged with bearer tokens / query strings in the URL line. Add an assertion test for the negative case (e.g.record.args = Noneor arity < 3) — your currenttest_uvicorn_access_log_filter_skips_metrics_pathonly covers the positive case.Nits
ocr_skipped_pages_totalis a counter exposed without labels — fine. No information disclosure.engine(2 values),script_type(4-ish),kind(2),outcome(2). Bounded and safe; no risk of metric-name explosion from user input.metrics.pyis a clean, auditable module — exactly the "centralize security-adjacent config" pattern I want. Good.Sara Holt — Senior QA Engineer
Verdict: Approved with concerns
16 new tests, each with a sentence-name, fresh
CollectorRegistry()per test for state isolation,ASGITransport(no docker, fast), assertions on exact counter values. This is what well-tested infrastructure code looks like — thank you. Test pyramid is correct: unit + in-process API tests, no E2E bloat.Blockers
None.
Concerns
test_metrics.py:308-329—test_ocr_training_runs_total_incremented_with_recognition_error_labelpatchesmain.asyncio.to_threadglobally with a side-effect that always raises. Problem:/traindoes ZIP validation and file IO before reaching the training runner — both of which useasyncio.to_threadin some code paths via FastAPI/Starlette. The test currently passes because the runner is the first thing patched, but it's coincidental coupling. Patch the inner_run_trainingsymbol instead (e.g.patch("main._run_training", side_effect=RuntimeError(...))if it can be made addressable, or capture and assert that the exception is in fact propagating from inside the wrapped runner, not from the surrounding endpoint).Missing coverage:
train_sender_model(/train-sender) — both the success and error branch incrementocr_training_runs_total{kind=recognition, ...}but the diff only tests/trainand/segtrain. If someone changes the label tokind=senderlater, no test catches it.ocr_model_accuracyGauge being not set whenresult["accuracy"] is None. Theif result.get("accuracy") is not None:guard on lines 503/590/707 is uncovered.ocr_processing_seconds.test_ocr_processing_seconds_histogram_observed_per_page_in_streamcovers the regular path; the guided path on line 220 also observes the histogram and is untested.test_metrics.py:139-148, 175-184, 286-296, ...— six tests duplicate the samewith patch(...), patch(...), ... :block with five-to-six entries. This is the rule-of-three for tests too. Extract a@pytest.fixturethat yields a context-managed stack of patches, or a helper function returning the configured client. Right now any change to e.g. the_download_and_convert_pdfpatch target requires updating six test bodies.Nits
test_startup_logs_warning_when_running_as_rootfailure onmain— fine to call out in the PR body, but please open a Gitea issue for it so it does not get forgotten when ASGITransport upgrades start running lifespan by default._histogram_count_sumreadschild._sum.get()andchild._buckets— touchingprometheus_clientinternals. Works today, but pinprometheus-clientinrequirements.txt(currently transitive) so an upgrade does not silently break these assertions.curl ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words...' lists the wired metricsas an automated smoke test in CI — that would close the loop on the manual verification still pending in the PR description.Leonie Voss — UI/UX Design Lead
Verdict: Approved — out of scope for UI/UX
This PR is entirely backend (
ocr-service/main.py,metrics.py,requirements.txt,prometheus.yml,test_metrics.py). No SvelteKit components, no Tailwind classes, no user-facing strings, no Paraglide message keys, no accessibility surface.What I checked
frontend/— confirmedmessages/{de,en,es}.json/metricsendpoint is an operator/Prometheus surface, not a user surface)Forward note
When the Grafana dashboards (#651) ship: please use brand-aligned panel colors (avoid traffic-light reds alone for status — pair with icons and labels), keep typography ≥ 12px on every panel, and verify dashboards at 320px width because we have operators on phones too.
LGTM from a UX standpoint.
Elicit — Requirements Engineer (Brownfield mode)
Verdict: Approved with concerns
The PR body maps cleanly to the ACs on #652 (AC2 through AC7 all called out with specific line/decision references). Decision points #1, #2, #3 from the issue are explicitly resolved with rationale in the description. This is the level of traceability I expect — well done.
Blockers
Two unchecked items in the PR's own Test plan remain unverified:
[ ] After image rebuild on the deployment host: curl ocr:8000/metrics returns 200 and the ocr-service target shows **UP** at http://localhost:9090/targets[ ] After one POST /ocr request: ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_ are listedThese are part of the Definition of Done for #652 (AC1, AC7). Either tick them with evidence (screenshot, curl transcript pasted into the PR comments) before merge, or move them to a follow-up issue with an owner and a date. An unchecked Test plan box is a TBD that quietly turns into requirements debt.
"Alerting rules — flagged for a follow-up by Tobias" — no Gitea issue linked. Per the workflow convention, every follow-up needs an issue number in the PR description. Open the follow-up issue ("Add Prometheus alerting rules for ocr_models_ready < 1, ocr_skipped_pages_total rate, ocr_training_runs_total{outcome=error}") and link it here, otherwise it disappears.
Concerns
NFR coverage check against #652:
docs/OBSERVABILITY.mdso future contributors know not to add e.g.user_idas a label.prometheus_clientregistry-collision occurs (e.g. twobuild_metrics(REGISTRY)calls). Today this is impossible becausemain.pyis imported once, but it's a brittle invariant.Glossary gap.
docs/GLOSSARY.mdshould now list: OCR metrics, illegible-word ratio, model accuracy gauge, models-ready gauge. Operators reading dashboards will not know what "illegible words" means without it.Nits (optional)
Local verification — B3 evidence
Ran the unchecked test-plan items against the worktree's
ocr-service/.venvbecausedocker compose up -d --build ocrwould have required rebuilding the Surya/Kraken model image on this laptop (multi-GB). Substituteduvicorn main:app --port 8000inside the worktree venv withKRAKEN_MODEL_PATH=/nonexistentso the lifespan startup runs without downloading the Kraken model. Same FastAPI app, sameInstrumentator(...).expose(app)wiring, sameMetricsPathFilter— only the runtime is host-python instead of container-python.1.
GET /metricsreturns 2002. All nine
ocr_*metric families are exposedocr_models_ready 1.0— confirms the lifespan flip is wired correctly.3.
POST /ocrincrementshttp_requests_totalDrove a request against the SSRF guard (bogus host) so no OCR work runs but the FastAPI middleware still observes the request.
Instrumentator(...).instrument(app).expose(app)is doing its job —http_*metrics appear after the request, exactly as the PR body's grep expects.Not exercised locally
/targetsUP/DOWN check (http://localhost:9090/targets) needs the observability stack. That stack is not part ofdocker-compose.yml; it lives indocker-compose.observability.ymlper ADR-009 and ADR-016. Verifying the scrape edge will happen on deploy. The scrape job is already ininfra/observability/prometheus/prometheus.yml(committed earlier on this branch:e75ac8ec ops(observability): drop TODO from ocr-service scrape job in prometheus.yml).🤖 Generated with Claude Code
Follow-up issues filed
feature+devops+phase-7: monitoring)test_startup_logs_warning_when_running_as_rootfails underASGITransport(bug+devops)feature+devops)🤖 Generated with Claude Code
🔁 Cycle 2 re-review — Felix Brandt
Verdict: ✅ Approved
Both blockers from my cycle-1 review are resolved and the new helpers are exactly the shape I asked for. The two concerns are also handled — partially via code change, partially via a clarifying test that locks in the chosen definition.
Cycle-1 blockers — resolved
Word-counter duplication → fixed.
_observe_block_words(words, threshold)extracted atocr-service/main.py:89-99, called from three sites (run_ocr, unguided stream loop, no remaining duplication). Docstring even documents thewordsnon-empty precondition. The helper is branch-free at the call sites, exactly the shape I sketched. Three sites → one helper, clean rule-of-three resolution.Training try/except duplication → fixed.
_record_training(runner, kind)extracted atocr-service/main.py:59-79. All three endpoints now end with one line:/train:return await _record_training(lambda: asyncio.to_thread(_run_training), kind="recognition")/train-sender: same,kind="recognition"/segtrain: same,kind="segmentation"The error counter, success counter, and accuracy gauge are all driven from one place — future training endpoints add metrics for free.
Cycle-1 concerns — addressed
Guided stream
page_startedafter preprocess → resolved by choosing one definition and locking it in with a test. The guided path now sumstime.monotonic()deltas around eachengine.extract_region_textcall (main.py:263, 269) and observes the sum once per page — engine time only, not preprocess, not spell-check. The histogram docstring inmetrics.py:69matches ("excluding preprocessing"), andtest_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_streamproves spell-check is excluded. Consistent with the unguided path (page_startedset after preprocess, observed once aroundextract_page_blocks). Good.ocr_jobs_total.inc()before download in stream → resolved by ordering. Looking at the currentrun_ocr_stream:_download_and_convert_pdfis awaited on line 219 before themetrics.ocr_jobs_total.labels(...).inc()on line 230. If the download raises (SSRF guard, HTTP error), the counter is not incremented.test_ocr_jobs_total_not_incremented_when_pdf_download_fails_in_stream(new) proves this. Counter now records jobs that at least reached page-loop entry, not jobs that died at download — matchesrun_ocr's post-extract increment closely enough for analytics.New issues from the fix commits
None blocking, two minor observations:
_record_trainingaccepts both sync and async runners viainspect.isawaitable(main.py:69-71), but every call site passeslambda: asyncio.to_thread(...), which is always a coroutine. The sync branch is dead code at the three call sites we have today. I'd simplify the signature torunner: Callable[[], Awaitable[dict]]and drop theinspectimport — KISS. Skip if you anticipate a future sync caller, but right now it's anticipating a use case that doesn't exist.The new
_record_trainingswallows the runner result's "loss"/"cer"/"epochs" fields and only inspectsaccuracy. That's fine — those other fields are returned as-is to the HTTP client — but the docstring on lines 64-66 says "Returns the result"; for clarity I'd note that onlyaccuracyis observed; the rest of the dict is opaque.Test additions verified
The cycle-1 test gaps are filled:
test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender,test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy,test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream. The error-path test now patchessubprocess.runinstead ofasyncio.to_thread(test_ocr_training_runs_total_incremented_with_recognition_error_label) — narrower seam, less coincidental coupling, exactly what Sara asked for.Verification evidence (#11617)
Sufficient for the AC1/AC7 checkboxes.
curl /metricsreturns 200, all nineocr_*families present with HELP/TYPE banners,ocr_models_ready 1.0confirms the lifespan flip wires through,http_requests_total{handler="/ocr"}proves the instrumentator middleware fires. The/targetsUP check is correctly deferred with a reason — the observability stack is a separate compose file and that's an on-deploy verification, not a local one.LGTM. Ship it.
🔁 Cycle 2 re-review — Markus Keller
Verdict: ✅ Approved
Every doc-update blocker from my cycle-1 review is resolved, and the new ADR is the right level of detail. Boring, well-documented, single-purpose change. This is how I want infrastructure additions delivered.
Cycle-1 blockers — resolved
docs/architecture/c4/l2-containers.puml→ updated. Two new relationships added, not just the OCR one I asked for:The backend arrow was missing on the diagram before — fixing both at once was the right move. C4 L2 now accurately reflects what Prometheus scrapes.
docs/OBSERVABILITY.md→ updated with a complete metric reference. Every custom metric is listed with type, labels, unit, and what it tracks. The PII warning at the top of the table ("Never label these with PII or document content") is exactly the operational guardrail I'd expect. Canonical PromQL examples for throughput, illegible-word ratio, p95 timing, training error rate, and accuracy are included — copy-paste-ready for the dashboard work in #651. The Caddy block snippet forrespond /metrics 404is also documented — Tobias will need that the day someone exposes ocr.example.com.ADR for the factory + registry-injection pattern → added as
docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md. Context, Decision, Consequences (positive/negative), Alternatives Considered — all four sections present and tight. The "Alternatives considered" section is particularly good: it explicitly names why the autouse-fixture and the per-call-arg-registry alternatives were rejected, so a future maintainer cannot accidentally undo the decision. Cross-links to issue, library, and code. This is the ADR template applied correctly.Cycle-1 concerns — addressed
Module-level side effect at import time (
metrics: OcrMetrics = build_metrics(REGISTRY)) → documented with an inline comment onmain.py:47-50explaining why it must be at import time (instrumentator registry binding) and what tests must do to override it (monkeypatchmain.metrics, not the registry). The ADR's "Negative consequences" section also names the same trap. Acceptable — would still prefer this in the module docstring ofmetrics.pyrather than only inmain.py, but the in-place comment is enough.Unauthenticated
/metrics→ documented with a threat-model comment onmain.py:121-125and a Caddy block in OBSERVABILITY.md. The comment names CWE-200 explicitly and gates onports:being added — exactly the operational invariant I wanted captured. ADR-023's "Negative consequences" cross-links to it.New issues from the fix commits
None. The doc updates are factually correct, consistent with each other, and consistent with the code:
prometheus.yml's scrape config.metrics.py's collector definitions exactly.prometheus-client==0.25.0" is now pinned is verified inrequirements.txt.I checked the diff line-by-line for the kind of doc-vs-code contradiction that creeps in during fix loops (e.g. doc says "labelled by foo" but code labels by bar). None found.
Glossary check
docs/GLOSSARY.mdgot four new entries: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. The "Segmentation model accuracy" entry explicitly notes the distinction from recognition accuracy — good, future-Sara will not have to dig through ADRs to figure out why the twokind=label values exist.Verification evidence (#11617)
The PR did the right thing — the
/targetsUP check is correctly deferred because the observability stack lives in a separate compose file per ADR-009 / ADR-016. Running it against the host venv with the same FastAPI app is a valid substitute for verifying the instrumentator wiring. The local verification covers what's verifiable locally; the post-deploy verification belongs in the deploy runbook, not in the PR.Ship it.
🔁 Cycle 2 re-review — Tobias Wendt
Verdict: ✅ Approved
All five items from my cycle-1 review are either resolved in this PR or filed as a tracked follow-up. This is operationally complete.
Cycle-1 blockers — resolved
Prometheus
/targetsUP verification → addressed pragmatically. The PR's own test plan items are now backed by the verification comment in #11617:GET /metricsreturns HTTP 200 ✓ (curl output included)ocr_*HELP/TYPE banners present ✓ocr_models_ready 1.0after lifespan ✓ (proves the gauge flip)http_requests_total{handler="/ocr",method="POST",status="4xx"} 1.0after a request ✓ (proves the instrumentator middleware fires)The
/targetsUP check atlocalhost:9090/targetsis correctly deferred. The observability stack lives indocker-compose.observability.yml, not the main stack — Marcel cannot run it on his laptop without a multi-GB Surya/Kraken image rebuild. Verifying the scrape edge on deploy is the right call. I'd want a one-line indocs/DEPLOYMENT.mdfor the on-deploy verification step ("after deploy, checkhttp://prom.example.com/targetsshows ocr-service UP within 30s"), but that's a small follow-up rather than a blocker.Alerting follow-up issue → filed as #654. I read it. It has three concrete rules with thresholds (
ocr_models_ready < 1 for 2m,>10% skipped page rate,>50% training error rate), correctfor:durations, severity labels, PromQL that uses the right metric names, and apromtool check rulesAC. This is a usable follow-up, not a placeholder. Labels (devops,feature,phase-7: monitoring) are correct.Cycle-1 concerns — addressed
Caddy block for
/metricsand/health→ added todocs/OBSERVABILITY.md. Snippet covers both endpoints in one@internal_onlymatcher. Good. I'd repeat this indocs/DEPLOYMENT.mdnext to the existing/actuator/*block so an ops engineer sees both routes in the same place, but the cross-reference from OBSERVABILITY.md is enough for now.prometheus-clienttransitive pin → fixed inocr-service/requirements.txt:Both pinned, both reproducible. Renovate will create bump PRs for either. ADR-023 documents the implication ("upgrading it requires an explicit bump and re-checking the instrumentator's compatibility range") — that's the correct operational note.
MetricsPathFilter is uvicorn-only → documented in
docs/OBSERVABILITY.md§ Internal-only endpoints:New issues from the fix commits
None operationally. Spot-checks I did:
prometheus.ymldiff is the cleanest possible (just removes the TODO block, scrape config unchanged) — no risk of breaking the existing backend scrape.MetricsPathFilteraddFiltercall is at module scope (main.py, afterInstrumentator(...).expose(app)) — runs once per process, no double-registration on reloads.excluded_handlers=["/health", "/metrics"]on the Instrumentator stops/healthand/metricsthemselves from pollutinghttp_requests_total— would have been a silly cardinality leak if missed.One small forward note
When #654 lands, add
rule_files: [rules/*.yml]toprometheus.ymland adocker-compose.observability.ymlvolume mount forinfra/observability/prometheus/rules/. The currentprometheus.ymldoes not mount a rules directory — that change is the prerequisite for #654, not part of it.Ship it.
🔁 Cycle 2 re-review — Nora "NullX" Steiner
Verdict: ✅ Approved
Both concerns from my cycle-1 review are resolved. The new test, the new comment, and the new documentation each pin down one piece of the threat model that was implicit before. This is the right way to ship security-adjacent code: not "is it currently safe?" but "is the next maintainer going to know what made it safe?"
Cycle-1 concerns — resolved
Unauthenticated
/metrics(CWE-200 information disclosure) → addressed at three layers, exactly the defense-in-depth I asked for:Code (
ocr-service/main.py:121-125):The comment names the CWE explicitly, names the trigger condition (
ports:mapping), and cross-references the operational mitigation. This is the "comments explain the threat model" pattern I keep asking for. Good.Documentation (
docs/OBSERVABILITY.md):This is the right shape —
respond 404rather thanrespond 403(deny without confirming existence). Caddy snippet is correct.ADR-023 explicitly captures the trade-off under "Negative consequences": "
/metricsis exposed unauthenticated and relies on the Docker internal network for confidentiality." Future-me cannot accidentally undo this without reading the ADR.The
excluded_handlers=["/health", "/metrics"]parameter onInstrumentator(...)also stops these two endpoints from pollutinghttp_requests_total{handler=...}— secondary benefit, prevents an attacker who reaches/metricsfrom counting their own probing.MetricsPathFilterdefensive against uvicorn format changes → fixed. The new testtest_uvicorn_access_log_filter_fails_open_on_short_or_missing_args(test_metrics.py:557-575) covers exactly the case I asked for:Fail-open behavior is locked in. If uvicorn ever changes to keyword args, the filter forwards the record to the handler —
/metricsrequests would briefly appear in console logs, but no log lines are silently dropped (which is the audit-trail concern). The docstring on the test explicitly names the invariant: "default-allow records when args is None or shorter than expected." Permanent regression coverage.New issues from the fix commits
I audited the diff for net-new security surface. Findings:
_record_training(main.py:59-79) — accepts a callable, raises through to the caller. No new injection surface. Thekind=label value is"recognition"/"segmentation", hard-coded at the call site — not user-controlled. Counter cardinality stays bounded._observe_block_words(main.py:89-99) — readsw["confidence"]and counts. Cardinality bounded. No PII in any counter or label. The reader of OBSERVABILITY.md is now explicitly warned: "Never label these with PII or document content".MetricsPathFilter._SUPPRESSED_PATHS = {"/metrics", "/health"}— closed set, no user input reaches the filter. Safe.http_requests_total{handler=...}still exposes the endpoint topology (including/train,/train-sender,/segtrain) to anyone who reaches/metrics. That's documented now in the threat-model comment and the Caddy snippet ��� acceptable given the internal-network constraint.One forward note for #654
When the alerting rules land, please do not add a rule that aggregates by
script_type(e.g.sum by (script_type) (rate(ocr_jobs_total[5m]))).script_typecomes from user-controlled request input — bounded today by frontend enum, but if a backend bug ever allows pass-through, an attacker could spike cardinality via crafted requests. Aggregate byengine(onlykraken/surya) or by no labels.Ship it.
🔁 Cycle 2 re-review — Sara Holt
Verdict: ⚠️ Approved with concerns
Every cycle-1 coverage gap from my review is filled, and the error-path test was refactored to a narrower seam — exactly what I asked for. The duplication concern is also resolved via the new
ocr_clientasync context manager. One new finding: a timing-dependent assertion that I'd flag as a flake risk.Cycle-1 concerns — resolved
patch("main.asyncio.to_thread", ...)coincidental coupling in error-path test → fixed.test_ocr_training_runs_total_incremented_with_recognition_error_label(test_metrics.py:344-371) now patchesmain.subprocess.runwith aCompletedProcess(returncode=1, ...)to drive the real_run_trainingerror path. The docstring explicitly names the choice:Coverage gaps → all three filled:
test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender(test_metrics.py:394-416) — covers/train-sender, assertskind="recognition", outcome="success". If anyone changes the kind label to"sender", this test catches it.test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy(test_metrics.py:419-441) — covers theif result.get("accuracy") is not Noneguard. Asserts_value.get() == 0.0when accuracy is None. The uncovered branch I named in cycle 1 is now covered.test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream(test_metrics.py:464-490) — covers the guided streaming path's histogram observation.count == 2.0after 2 pages, sum non-negative. The guided path is now under test.Six tests with duplicated
with patch(...)stacks → resolved with the newocr_client(*, raise_app_exceptions=True)async context manager (test_metrics.py:21-37). It bundleskraken_engine.load_models+load_spell_checkerpatches, the ASGITransport+AsyncClient, the_models_ready = Trueflag flip, and the cleanup. The tests that use it are now 5-7 lines shorter each. The non-ocr_client tests (training endpoints) still have their own patch stacks because they needTRAINING_TOKEN+_models_ready+asyncio.to_thread— different concerns. Good judgment on where to stop extracting.New issues from the fix commits
⚠️ Timing-dependent assertion is a flake risk (
test_metrics.py:493-528). The newtest_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_streamis conceptually right — it locks in "engine-only timing, excluding spell-check" — but the assertion bound is tight:The slack stated in the comment is 30ms, but the assertion bound is 50ms. On a loaded CI runner, an
asyncio.to_thread-scheduled mock returning"text"is not literally zero time — context switches alone can hit single-digit ms. With two regions andtime.monotonic()calls bracketing each, I'd expect 5-15ms of measurement overhead in the happy case. 50ms is probably OK, but I would (a) widen the bound to< 0.07(twice the stated 30ms slack), or (b) replace the timing assertion with a structural one — e.g. patchtime.monotonicand assert the deltas exclude the spell-check call entirely. Pure-timing assertions on CI are the #1 source of flakes in this codebase.fresh_metricsfixture coexists with the productionmetrics: OcrMetrics = build_metrics(REGISTRY)import-time call (main.py:42). For tests that don't use the fixture (e.g.test_metrics_endpoint_returns_200,test_metrics_includes_http_request_metrics_after_ocr_call), they hit the global REGISTRY and accumulate state across test runs. Today this works because those tests only assert on the presence of metric names, not on counter values. If anyone later adds a value assertion to one of them, the test will pass solo and fail in the suite, which is the classic state-leak symptom. Worth a# uses global REGISTRY by designcomment on those two tests, or — better — usefresh_metricseverywhere counter-touching code paths run.Verification evidence (#11617)
Adequate for the local-verifiable items.
/targetsdeferral to deploy is fine — I'd want a CI smoke test against the observability stack eventually (the test plan note in #11617 suggests this would close the manual loop), but that's a follow-up for #656's benchmark runbook.#655 (pre-existing
test_startup_logs_warning_when_running_as_rootfailure) — thanks for filing it. The root cause analysis in the issue body is correct:ASGITransportdoes not run lifespan by default; the metrics suite routes around it withapp.router.lifespan_context(app)and the same pattern fixes the failing test.TL;DR
One genuine concern (timing flake risk on the spell-check exclusion test). Coverage gaps from cycle 1 closed. The author's test refactoring shows real attention to feedback — error-path test went from "patches the boundary" to "patches
subprocess.run," which is the right direction. Ship after the timing assertion bound is widened or replaced.🔁 Cycle 2 re-review — Leonie Voss
Verdict: ✅ Approved — out of scope for UI/UX (unchanged from cycle 1)
Re-confirming my cycle-1 verdict: this PR remains entirely backend-only. The cycle-2 fix commits are all in
ocr-service/main.py,ocr-service/metrics.py,ocr-service/test_metrics.py,docs/, andinfra/observability/. No SvelteKit, no Tailwind, no Paraglide.What I checked in cycle 2
docs/GLOSSARY.md,docs/OBSERVABILITY.md,docs/adr/023-...md,docs/architecture/c4/l2-containers.puml,infra/observability/prometheus/prometheus.yml,ocr-service/main.py,ocr-service/metrics.py,ocr-service/requirements.txt,ocr-service/test_metrics.py).messages/{de,en,es}.json. The error message in OBSERVABILITY.md ("Models not loaded yet" → HTTP 503) is operator-facing on/health, not user-facing./metricsendpoint is for Prometheus, not for any human reader.Forward note (still applies for #651)
When the Grafana dashboards land via #651 and the alerting via #654, please:
ocr_models_readyandocr_skipped_pages_totalratios are the two values most likely to be glanced at on a mobile alert page.--palette-navy,--palette-mint) where possible; consider a sand background overlay if Grafana lets you set per-dashboard theming.I'll review #651's dashboard PR when it opens — that's where the UI/UX rigor matters.
Verification evidence (#11617)
Not applicable — verification was correctly scoped to backend / observability surface. No UI to verify.
LGTM from a UX standpoint. No UI surface in this PR.
🔁 Cycle 2 re-review — Elicit (Brownfield mode)
Verdict: ✅ Approved
All four items from my cycle-1 review are either resolved or filed as a tracked follow-up with proper labels, milestones, and ACs. The traceability from issue #652 → PR #653 → ADR-023 → follow-ups #654/#655/#656 is intact. This is the level of requirements hygiene I want for every feature delivery.
Cycle-1 blockers — resolved
Two unchecked Test plan items → addressed in two correct ways:
GET /metricsreturns 200, all nineocr_*HELP/TYPE banners present,ocr_models_ready 1.0after lifespan,http_requests_total{handler="/ocr"}after a request. Each claim backed by a$ curl ...transcript. This converts the unchecked items into evidence.http://localhost:9090/targetsshowing UP) is correctly deferred with a documented reason: the observability stack lives indocker-compose.observability.ymlper ADR-009 / ADR-016. The verification will happen on deploy. This is the correct disposition — not "we'll do it later," but "we have a different runbook for this class of check."I would still want a one-line acceptance check added to a deploy runbook ("after deploy, verify
http://prom.example.com/targetsshows ocr-service UP within 30s"), but that's a polish point, not a requirements gap. The PR's own Test plan now reads as a faithful record of what was verified where.Alerting follow-up issue → filed as #654 with the proper labels (
devops+feature+phase-7: monitoring), three concrete rules withfor:durations and severity labels, PromQL that uses the metric names actually shipped here, apromtool check rulesAC, and cross-references to ADR-023 and PR #653. This is a usable follow-up issue, not a placeholder. The traceability is intact: #652 (parent feature) → #653 (this PR) → #654 (the alert rules that consume #653's metrics) → ADR-023 (the architectural decision that produced both).Cycle-1 concerns — addressed
NFR coverage on #652:
docs/architecture/NFRs.md, a reproducible load-test script AC, and a stretch goal of nightly CI benchmarking. The issue body explicitly says "calibrate from one baseline + one instrumented run before locking it in" — exactly the right way to introduce a perf NFR. Good.docs/OBSERVABILITY.md: "Never label these with PII or document content — labels have unbounded cardinality risk and are visible to anyone with Grafana access." This is the operator-facing invariant I wanted recorded.Glossary gap → addressed.
docs/GLOSSARY.mdnow lists four new terms: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. Each entry explains what the term means and which metric exposes it. The two accuracy entries explicitly call out that recognition vs segmentation are independent metrics — operators reading the dashboard will know why both exist. Good.New issues from the fix commits
I read every doc change for the kind of doc-vs-code contradiction that creeps in during fix loops. Findings:
metrics.py's collector definitions exactly (labels, types, units).prometheus-client==0.25.0pinned explicitly" is verified inrequirements.txt.Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")matches the actual scrape config inprometheus.yml(targetocr:8000, path/metrics).main.pylifespan exactly (loads kraken_engine → loads spell_checker → sets_models_ready = True→ setsocr_models_ready.set(1)).ocr_models_ready < 1 for 2malert" cross-references #654's Alert 1 exactly.No contradictions found.
One small forward note
The Definition of Done for #652 included AC1 ("Prometheus target flips DOWN → UP after image rebuild"). Strictly, that AC is closed only after the deployment-host verification runs. When you tick #652 as done, please reference the deploy log /
/targetsscreenshot in the issue close-comment. Otherwise #652 closes with one paper AC still unverified.TL;DR
The author treated cycle-1 feedback as a structured backlog, resolved the in-scope items in this PR, and filed three follow-up issues (#654, #655, #656) with proper labels and ACs for the out-of-scope items. The traceability from feature → PR → ADR → follow-ups is intact. The verification evidence is sufficient for what's locally verifiable, with explicit deferral for what isn't.
This is what good requirements delivery looks like. Approved.
🔁 Cycle 3 re-review — Sara Holt
Verdict: Approve.
Both cycle-2 concerns are resolved with technical rigor. Full
test_metrics.pyis green locally (22 passed in 0.56s).S1 — Timing-flake risk in
test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream(commite0e1578b)Bound widened from
< 0.05sto< 0.09s. I checked the math against the failure mode:correct_textaccidentally re-entering the timer would observe2 × 0.05s = 0.1sof spell-check sleep. The0.09sceiling sits below that floor, so the test still trips on the regression it's meant to catch.patch("main.time.monotonic")with a deterministic sequence) didn't pan out —time.monotonicis global and httpx/asyncio consume it before the request reaches the engine loop. That's a real characteristic oftime.monotonicpatching in async stacks; I'm not going to ask anyone to wrestle a structural fix where the seam itself is shared with the transport layer.This is the right pragmatic resolution.
S2 — Two tests on shared REGISTRY without
fresh_metrics(commit0801da8d)Docstrings added to both
test_metrics_endpoint_returns_200andtest_metrics_includes_http_request_metrics_after_ocr_call. I checked both explanations against the trap they're guarding:fresh_metricswould break them:prometheus-fastapi-instrumentatorbinds the/metricsroute to the default REGISTRY at app-construction time. Swappingmain.metricsvia the fixture would not redirect what/metricsactually exposes. Thehttp_requests_total/http_request_duration_secondscounters live on the instrumentator's default REGISTRY — a freshCollectorRegistrywould never see them. Both points are stated explicitly in the docstrings.assert "http_requests_total 5.0" in bodywould have to ignore that explicit sentence. That's about as good as a docstring can do.Migrating these to
fresh_metricswas the obvious-looking fix, but it would have introduced a defect (counter assertions silently passing against an empty registry). Documenting the design intent is the right call here.Scope discipline
I checked the two fix commits did not regress anything else in scope of the original PR — they only touch docstrings + one bound constant. No collateral damage.
✅ Approve. Ship it.
— Sara Holt