feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

Merged
marcel merged 27 commits from feat/issue-652-ocr-metrics into main 2026-05-21 18:16:48 +02:00
Owner

Closes #652.

Summary

  • Mounts prometheus-fastapi-instrumentator==7.0.0 so ocr:8000/metrics serves the Prometheus exposition format (Prometheus target flips from DOWN → UP after image rebuild).
  • Adds a metrics.py factory that returns a frozen OcrMetrics dataclass bound to a CollectorRegistry. Production binds against the default REGISTRY; tests pass a fresh CollectorRegistry() and monkeypatch main.metrics, so counter state is isolated between cases (decision #3, option A).
  • Wires every metric named in the ACs:
    • ocr_jobs_total{engine, script_type} — AC2
    • ocr_pages_total{engine}, ocr_skipped_pages_total — AC3
    • ocr_words_total, ocr_illegible_words_total (summed at the call sites in main.py per decision #1) — AC4
    • ocr_processing_seconds{engine} Histogram around every asyncio.to_thread(engine.extract_*) call — AC5
    • ocr_training_runs_total{kind, outcome} and ocr_model_accuracy{kind} with kind=recognition for /train + /train-sender, kind=segmentation for /segtrain (decision #2) — AC6
    • ocr_models_ready Gauge flipped to 1 once the lifespan startup finishes — AC7
  • Adds a MetricsPathFilter on uvicorn.access to drop log records for /metrics and /health (Nora's recommendation).
  • Removes the TODO comment block in infra/observability/prometheus/prometheus.yml — the scrape target itself was already pointing at ocr:8000/metrics and just needed the endpoint to exist.

Test plan

  • pytest ocr-service/test_metrics.py — 16 new tests, all green
  • pytest ocr-service/test_main.py test_training_auth.py test_confidence.py test_metrics.py — 43 of 44 pass (the one failure test_startup_logs_warning_when_running_as_root is pre-existing on main and unrelated; ASGITransport does not run lifespan by default, so the lifespan-based assertion in that test never sees the warning)
  • After image rebuild on the deployment host: curl ocr:8000/metrics returns 200 and the ocr-service target shows UP at http://localhost:9090/targets
  • After one POST /ocr request: curl -s ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_' lists the wired metrics

Out of scope

  • Grafana panels — tracked in #651
  • Alerting rules (e.g. ocr_models_ready < 1 for 2m) — flagged for a follow-up by Tobias

🤖 Generated with Claude Code

Closes #652. ## Summary - Mounts `prometheus-fastapi-instrumentator==7.0.0` so `ocr:8000/metrics` serves the Prometheus exposition format (Prometheus target flips from DOWN → UP after image rebuild). - Adds a `metrics.py` factory that returns a frozen `OcrMetrics` dataclass bound to a `CollectorRegistry`. Production binds against the default `REGISTRY`; tests pass a fresh `CollectorRegistry()` and `monkeypatch` `main.metrics`, so counter state is isolated between cases (decision #3, option A). - Wires every metric named in the ACs: - `ocr_jobs_total{engine, script_type}` — AC2 - `ocr_pages_total{engine}`, `ocr_skipped_pages_total` — AC3 - `ocr_words_total`, `ocr_illegible_words_total` (summed at the call sites in `main.py` per decision #1) — AC4 - `ocr_processing_seconds{engine}` Histogram around every `asyncio.to_thread(engine.extract_*)` call — AC5 - `ocr_training_runs_total{kind, outcome}` and `ocr_model_accuracy{kind}` with `kind=recognition` for `/train` + `/train-sender`, `kind=segmentation` for `/segtrain` (decision #2) — AC6 - `ocr_models_ready` Gauge flipped to 1 once the lifespan startup finishes — AC7 - Adds a `MetricsPathFilter` on `uvicorn.access` to drop log records for `/metrics` and `/health` (Nora's recommendation). - Removes the TODO comment block in `infra/observability/prometheus/prometheus.yml` — the scrape target itself was already pointing at `ocr:8000/metrics` and just needed the endpoint to exist. ## Test plan - [x] `pytest ocr-service/test_metrics.py` — 16 new tests, all green - [x] `pytest ocr-service/test_main.py test_training_auth.py test_confidence.py test_metrics.py` — 43 of 44 pass (the one failure `test_startup_logs_warning_when_running_as_root` is pre-existing on `main` and unrelated; `ASGITransport` does not run lifespan by default, so the lifespan-based assertion in that test never sees the warning) - [ ] After image rebuild on the deployment host: `curl ocr:8000/metrics` returns 200 and the `ocr-service` target shows **UP** at `http://localhost:9090/targets` - [ ] After one `POST /ocr` request: `curl -s ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_'` lists the wired metrics ## Out of scope - Grafana panels — tracked in #651 - Alerting rules (e.g. `ocr_models_ready < 1 for 2m`) — flagged for a follow-up by Tobias 🤖 Generated with [Claude Code](https://claude.com/claude-code)
marcel added 14 commits 2026-05-21 16:17:47 +02:00
Mount the instrumentator immediately after FastAPI app creation, excluding
/health and /metrics from request metrics to keep http_requests_total focused
on real application traffic.

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing
http_requests_total and http_request_duration_seconds for application
traffic, not just register the /metrics endpoint.

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and
exposes a `build_metrics(registry)` factory. Production main.py binds against
the default REGISTRY; tests construct a fresh CollectorRegistry per case and
monkeypatch main.metrics, so counter values stay isolated between tests
(decision #3 on issue #652, Option A).

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise,
then increment after the blocks have been extracted.

Refs #652 (AC2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that
sets engine_name from script_type now has explicit coverage for both
HANDWRITING_KURRENT → kraken and TYPEWRITER → surya.

Refs #652 (AC2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bumps the counter inside both the standard and guided /ocr/stream
generators after a page yields its blocks, before the per-page json line is
emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after
engine selection so the counter still fires when a page later errors out.

Refs #652 (AC3a)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bumps the counter in both /ocr/stream except blocks (standard and guided
generators) so the existing skipped_pages local variable now also flows
into Prometheus.

Refs #652 (AC3b)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Walks block["words"] before apply_confidence_markers strips the list, then
increments ocr_words_total by len(words) and ocr_illegible_words_total by
the count below threshold. Same pattern in both /ocr and /ocr/stream so the
ratio illegible/words is a faithful quality signal across endpoints.

Refs #652 (AC4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps every asyncio.to_thread(engine.extract_*) call with time.monotonic()
deltas in /ocr (per document) and in both /ocr/stream generators (per page).
Streaming buckets are the useful operational signal; the non-streaming
observation is a bonus.

Refs #652 (AC5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps the await asyncio.to_thread(_run_*) calls in /train, /train-sender,
and /segtrain with try/except. Recognition training (/train, /train-sender)
shares kind="recognition"; /segtrain uses kind="segmentation". The
ocr_model_accuracy gauge is set per kind on success.

Refs #652 (AC6, decision #2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hits /train then /segtrain through the same test, each with a distinct
mocked accuracy, and asserts the labelled gauges reflect the two values.
Locks down the kind-label separation between recognition and segmentation
accuracy (decision #2).

Refs #652 (AC6)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the existing _models_ready bool so Prometheus has a time-series
liveness/readiness signal for future alerting rules (e.g.
ocr_models_ready < 1 for 2m).

Refs #652 (AC7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a logging.Filter on uvicorn.access that drops records whose request
path is /metrics or /health. Each is hit on a tight schedule (Prometheus
scrape interval and Docker healthcheck), so unfiltered they dominate the
access log without carrying any information about real traffic.

Refs #652 (Nora's recommendation)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ops(observability): drop TODO from ocr-service scrape job in prometheus.yml
All checks were successful
CI / Backend Unit Tests (pull_request) Successful in 3m27s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (pull_request) Successful in 3m24s
CI / OCR Service Tests (pull_request) Successful in 20s
e75ac8ec45
The TODO was a placeholder for this work — the OCR service now exposes
/metrics so the target will flip from DOWN to UP on next image rebuild.

Refs #652

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Felix Brandt — Senior Fullstack Developer

Verdict: Approved with concerns

Clean, well-tested change. TDD evidence is strong (16 new tests, fresh-registry isolation per decision #3, each test names a single behaviour as a sentence). metrics.py is small, frozen-dataclass, single-purpose — exactly the size I want for new modules. build_metrics(registry) is testable by construction. Type hints, docstrings, and from __future__ import annotations are all in place.

Blockers

  1. ocr-service/main.py:144-147, 251-254, 285-288 — duplicated word-counter logic. The five-line block summing len(words) and sum(1 for w in words if w["confidence"] < threshold) now appears three times. Extract a private helper:

    def _observe_block_words(words: list[dict], threshold: float) -> None:
        metrics.ocr_words_total.inc(len(words))
        metrics.ocr_illegible_words_total.inc(
            sum(1 for w in words if w["confidence"] < threshold)
        )
    

    Three duplications is the rule-of-three trigger; the helper also makes the test surface smaller.

  2. ocr-service/main.py:498-504, 585-591, 702-708 — duplicated try/except for training counters. Same three-block pattern around every await asyncio.to_thread(_run_*). Extract:

    async def _record_training(runner, kind: str):
        try:
            result = await asyncio.to_thread(runner)
        except Exception:
            metrics.ocr_training_runs_total.labels(kind=kind, outcome="error").inc()
            raise
        metrics.ocr_training_runs_total.labels(kind=kind, outcome="success").inc()
        if result.get("accuracy") is not None:
            metrics.ocr_model_accuracy.labels(kind=kind).set(result["accuracy"])
        return result
    

Concerns

  1. ocr-service/main.py:201-219 (guided streaming) — page_started taken AFTER await asyncio.to_thread(preprocess_page, ...). That preprocess call is the most expensive step per page and it lands in zero histograms. In the unguided path on line 288 you start the timer just before preprocessing — pick one definition and document it on the metric description, or measure both.

  2. ocr-service/main.py:185 (run_ocr_stream) — ocr_jobs_total.inc() fires before the document is downloaded. If the PDF download or page render fails, the job counter has already been incremented and ocr_pages_total will be zero — analytics will read "1 job, 0 pages, 0 skipped." In run_ocr (line 142) the increment correctly lands after extract returns. Move the stream increment after the first successful page, or after the download.

Nits (skip if tight)

  • metrics.py:38-91build_metrics() is 53 lines of declaration; readable as a config block, no need to split.
  • test_metrics.py:106-128, 137-159, etc. — six tests open with the same six patch() calls. A @pytest.fixture returning the entered context would shrink each test by ~10 lines, but the explicit form is fine to keep.
  • test_metrics.py:106 — fixture uses monkeypatch.setattr("main.metrics", ...), but the production metrics symbol is captured by closure inside the endpoint functions at call time (good); just worth a one-line comment explaining why the patch works.
## Felix Brandt — Senior Fullstack Developer **Verdict: Approved with concerns** Clean, well-tested change. TDD evidence is strong (16 new tests, fresh-registry isolation per decision #3, each test names a single behaviour as a sentence). `metrics.py` is small, frozen-dataclass, single-purpose — exactly the size I want for new modules. `build_metrics(registry)` is testable by construction. Type hints, docstrings, and `from __future__ import annotations` are all in place. ### Blockers 1. **`ocr-service/main.py:144-147, 251-254, 285-288` — duplicated word-counter logic.** The five-line block summing `len(words)` and `sum(1 for w in words if w["confidence"] < threshold)` now appears three times. Extract a private helper: ```python def _observe_block_words(words: list[dict], threshold: float) -> None: metrics.ocr_words_total.inc(len(words)) metrics.ocr_illegible_words_total.inc( sum(1 for w in words if w["confidence"] < threshold) ) ``` Three duplications is the rule-of-three trigger; the helper also makes the test surface smaller. 2. **`ocr-service/main.py:498-504, 585-591, 702-708` — duplicated try/except for training counters.** Same three-block pattern around every `await asyncio.to_thread(_run_*)`. Extract: ```python async def _record_training(runner, kind: str): try: result = await asyncio.to_thread(runner) except Exception: metrics.ocr_training_runs_total.labels(kind=kind, outcome="error").inc() raise metrics.ocr_training_runs_total.labels(kind=kind, outcome="success").inc() if result.get("accuracy") is not None: metrics.ocr_model_accuracy.labels(kind=kind).set(result["accuracy"]) return result ``` ### Concerns 3. **`ocr-service/main.py:201-219` (guided streaming) — `page_started` taken AFTER `await asyncio.to_thread(preprocess_page, ...)`.** That preprocess call is the most expensive step per page and it lands in zero histograms. In the unguided path on line 288 you start the timer just before preprocessing — pick one definition and document it on the metric description, or measure both. 4. **`ocr-service/main.py:185` (`run_ocr_stream`) — `ocr_jobs_total.inc()` fires before the document is downloaded.** If the PDF download or page render fails, the job counter has already been incremented and `ocr_pages_total` will be zero — analytics will read "1 job, 0 pages, 0 skipped." In `run_ocr` (line 142) the increment correctly lands after extract returns. Move the stream increment after the first successful page, or after the download. ### Nits (skip if tight) - `metrics.py:38-91` — `build_metrics()` is 53 lines of declaration; readable as a config block, no need to split. - `test_metrics.py:106-128, 137-159, etc.` — six tests open with the same six `patch()` calls. A `@pytest.fixture` returning the entered context would shrink each test by ~10 lines, but the explicit form is fine to keep. - `test_metrics.py:106` — fixture uses `monkeypatch.setattr("main.metrics", ...)`, but the production `metrics` symbol is captured by closure inside the endpoint functions at call time (good); just worth a one-line comment explaining why the patch works.
Author
Owner

Markus Keller — Senior Application Architect

Verdict: Approved with concerns

The factory + registry-injection pattern is the right call here. It keeps the prometheus_client global out of the test surface, the dataclass freezes the contract, and the module boundary between metrics.py and main.py is clean. Boring, monolithic, no new infrastructure introduced — exactly the kind of incremental change I prefer.

Blockers

  1. Doc-update omissions (these block merge per my standing rule):
    • docs/architecture/c4/l2-containers.puml — the OCR container now exposes a Prometheus scrape endpoint. The arrow Prometheus → ocr-service either does not exist on the L2 diagram or is unannotated. Add it.
    • docs/OBSERVABILITY.md — new custom metrics (ocr_jobs_total, ocr_pages_total, ocr_skipped_pages_total, ocr_words_total, ocr_illegible_words_total, ocr_processing_seconds, ocr_training_runs_total, ocr_model_accuracy, ocr_models_ready) must be listed with their labels and units, or an operator reading the doc will not know they exist.
    • The PR body mentions a TODO removed from prometheus.yml — fine, but no ADR was added for "we use prometheus-fastapi-instrumentator + a per-module factory for registry injection." This is the architectural decision that justifies decision #3 on issue #652; capture it in docs/adr/ so the next maintainer does not undo the factory pattern.

Concerns

  1. ocr-service/main.py:42metrics: OcrMetrics = build_metrics(REGISTRY) runs at module import. Side effect at import time. It works (and you have to do it for the instrumentator to find the registry), but document it in the module docstring. Otherwise a future refactor that lazy-imports main for any reason will silently double-register and crash.

  2. ocr-service/main.py:82Instrumentator(...).instrument(app).expose(app) mounts /metrics unauthenticated. Acceptable today because ocr:8000 is on the internal Docker network and not behind Caddy — but the moment someone exposes the OCR container, this leaks request counts and timing. Add a comment in main.py next to the Instrumentator(...) call naming the threat model ("internal network only; never expose ocr:8000 publicly"), or have Nora speak to it.

Nits

  • OcrMetrics is a single feature module — no boundary violation. Cross-domain access is N/A here.
## Markus Keller — Senior Application Architect **Verdict: Approved with concerns** The factory + registry-injection pattern is the right call here. It keeps the `prometheus_client` global out of the test surface, the dataclass freezes the contract, and the module boundary between `metrics.py` and `main.py` is clean. Boring, monolithic, no new infrastructure introduced — exactly the kind of incremental change I prefer. ### Blockers 1. **Doc-update omissions (these block merge per my standing rule):** - `docs/architecture/c4/l2-containers.puml` — the OCR container now exposes a Prometheus scrape endpoint. The arrow Prometheus → ocr-service either does not exist on the L2 diagram or is unannotated. Add it. - `docs/OBSERVABILITY.md` — new custom metrics (`ocr_jobs_total`, `ocr_pages_total`, `ocr_skipped_pages_total`, `ocr_words_total`, `ocr_illegible_words_total`, `ocr_processing_seconds`, `ocr_training_runs_total`, `ocr_model_accuracy`, `ocr_models_ready`) must be listed with their labels and units, or an operator reading the doc will not know they exist. - The PR body mentions a TODO removed from `prometheus.yml` — fine, but no ADR was added for "we use prometheus-fastapi-instrumentator + a per-module factory for registry injection." This is the architectural decision that justifies decision #3 on issue #652; capture it in `docs/adr/` so the next maintainer does not undo the factory pattern. ### Concerns 2. **`ocr-service/main.py:42` — `metrics: OcrMetrics = build_metrics(REGISTRY)` runs at module import.** Side effect at import time. It works (and you have to do it for the instrumentator to find the registry), but document it in the module docstring. Otherwise a future refactor that lazy-imports `main` for any reason will silently double-register and crash. 3. **`ocr-service/main.py:82` — `Instrumentator(...).instrument(app).expose(app)`** mounts `/metrics` unauthenticated. Acceptable today because `ocr:8000` is on the internal Docker network and not behind Caddy — but the moment someone exposes the OCR container, this leaks request counts and timing. Add a comment in `main.py` next to the `Instrumentator(...)` call naming the threat model ("internal network only; never expose `ocr:8000` publicly"), or have Nora speak to it. ### Nits - `OcrMetrics` is a single feature module — no boundary violation. Cross-domain access is N/A here.
Author
Owner

Tobias Wendt — DevOps & Platform Engineer

Verdict: Approved with concerns

Long-awaited cleanup — the Prometheus target finally flips from DOWN to UP, the TODO in prometheus.yml was a real operational smell, and the access-log filter is exactly what I would have asked for. Good choice on prometheus-fastapi-instrumentator==7.0.0 (pinned, not >=).

Blockers

  1. infra/observability/prometheus/prometheus.yml has no scrape interval / timeout / metrics_path freshness check after the change. The diff only removes the TODO and leaves metrics_path: /metrics + targets: ['ocr:8000']. Verify by hand (or attach a screenshot to this PR) that http://localhost:9090/targets shows the ocr-service target as UP with last-scrape < 30s. The PR's Test plan still has this checkbox unchecked — that must be ticked before merge, not after.

  2. No alerting rule for ocr_models_ready < 1 for 2m. The PR explicitly defers this to "a follow-up." Fine — but open a Gitea issue and link it from this PR before merge. The gauge is worthless on its own; an alert is what makes it operational. Without a tracking issue, "follow-up" is how alert gaps live for six months.

Concerns

  1. ocr-service/main.py:82Instrumentator().expose(app) adds a public /metrics endpoint with no auth. Today the OCR container is only reachable from inside the compose network, so this is fine. If anyone ever puts ocr-service behind Caddy directly, this leaks. Add a Caddy block in docs/DEPLOYMENT.md showing respond /metrics 404 if/when ocr-service goes public. Same applies to /health already, so document both together.

  2. ocr-service/requirements.txt:13prometheus-fastapi-instrumentator==7.0.0 is pinned, but its transitive prometheus-client is not. That is mostly fine because pip's resolver will pick a compatible version, but it means the Docker image is not reproducible across base-image rebuilds. Either add prometheus-client==<x.y.z> to the requirements or commit a requirements.lock / pip-compile output. Renovate will keep both up to date.

  3. The MetricsPathFilter on uvicorn.access is a Python-side log filter — once we ship Loki, the noise re-appears at the container-log layer. Either also filter in Promtail/Alloy config, or note in docs/OBSERVABILITY.md that the suppression is local to uvicorn's logger so future Loki users do not chase a "missing log" ghost.

Nits

  • ocr-service/test_metrics.py runs synchronously against ASGITransport(app=app) — fast, no docker required. Plays nicely with CI.
  • The PR description correctly notes the pre-existing test_startup_logs_warning_when_running_as_root failure is unrelated; thanks for calling that out so I did not chase it.
## Tobias Wendt — DevOps & Platform Engineer **Verdict: Approved with concerns** Long-awaited cleanup — the Prometheus target finally flips from DOWN to UP, the TODO in `prometheus.yml` was a real operational smell, and the access-log filter is exactly what I would have asked for. Good choice on `prometheus-fastapi-instrumentator==7.0.0` (pinned, not `>=`). ### Blockers 1. **`infra/observability/prometheus/prometheus.yml` has no scrape interval / timeout / metrics_path freshness check after the change.** The diff only removes the TODO and leaves `metrics_path: /metrics` + `targets: ['ocr:8000']`. Verify by hand (or attach a screenshot to this PR) that `http://localhost:9090/targets` shows the ocr-service target as **UP** with last-scrape < 30s. The PR's Test plan still has this checkbox unchecked — that must be ticked before merge, not after. 2. **No alerting rule for `ocr_models_ready < 1 for 2m`.** The PR explicitly defers this to "a follow-up." Fine — but open a Gitea issue and link it from this PR before merge. The gauge is worthless on its own; an alert is what makes it operational. Without a tracking issue, "follow-up" is how alert gaps live for six months. ### Concerns 3. **`ocr-service/main.py:82` — `Instrumentator().expose(app)` adds a public `/metrics` endpoint with no auth.** Today the OCR container is only reachable from inside the compose network, so this is fine. If anyone ever puts ocr-service behind Caddy directly, this leaks. Add a Caddy block in `docs/DEPLOYMENT.md` showing `respond /metrics 404` if/when ocr-service goes public. Same applies to `/health` already, so document both together. 4. **`ocr-service/requirements.txt:13` — `prometheus-fastapi-instrumentator==7.0.0` is pinned, but its transitive `prometheus-client` is not.** That is mostly fine because pip's resolver will pick a compatible version, but it means the Docker image is not reproducible across base-image rebuilds. Either add `prometheus-client==<x.y.z>` to the requirements or commit a `requirements.lock` / `pip-compile` output. Renovate will keep both up to date. 5. **The `MetricsPathFilter` on `uvicorn.access` is a Python-side log filter — once we ship Loki, the noise re-appears at the container-log layer.** Either also filter in Promtail/Alloy config, or note in `docs/OBSERVABILITY.md` that the suppression is local to uvicorn's logger so future Loki users do not chase a "missing log" ghost. ### Nits - `ocr-service/test_metrics.py` runs synchronously against `ASGITransport(app=app)` — fast, no docker required. Plays nicely with CI. - The PR description correctly notes the pre-existing `test_startup_logs_warning_when_running_as_root` failure is unrelated; thanks for calling that out so I did not chase it.
Author
Owner

Nora "NullX" Steiner — Application Security Engineer

Verdict: Approved with concerns

Mostly clean. Custom metrics carry no PII (counters/gauges over engine/script_type/kind/outcome labels, no usernames, no document IDs, no PDF URLs). The training endpoints still gate on X-Training-Token. No new injection surface, no shell exec, no untrusted deserialization.

Blockers

None.

Concerns

  1. ocr-service/main.py:82 — unauthenticated /metrics endpoint. Instrumentator().instrument(app).expose(app) mounts /metrics with no auth and no IP allowlist. The default exporter exposes:

    • all custom OCR counters/histograms (low-sensitivity but enables traffic analysis)
    • the auto-instrumented http_requests_total{handler=...} — this reveals every URL path the service exposes, including the existence of /train, /train-sender, /segtrain. An attacker who reaches the metrics endpoint learns the service topology for free.

    Threat model: today ocr:8000 is only reachable from inside the compose network, so this is acceptable. The risk is operational drift — if someone ever routes ocr-service through Caddy, this becomes public information disclosure (CWE-200). Mitigation, in order of preference:

    • At Caddy: respond /metrics 404 for any future public route (coordinate with Tobias).
    • In code: bind /metrics to a separate management port via Instrumentator(...).expose(app, endpoint="/metrics", include_in_schema=False, tags=["internal"]) and document the internal-network requirement in a comment above line 82 stating the threat model (per my "comments explain the threat model" rule).
  2. ocr-service/main.py:93-97MetricsPathFilter reads record.args[2] assuming uvicorn's positional format. Defensive but path-tolerant — good. However, if uvicorn ever changes its format string to keyword args, this silently turns off and /metrics requests get logged with bearer tokens / query strings in the URL line. Add an assertion test for the negative case (e.g. record.args = None or arity < 3) — your current test_uvicorn_access_log_filter_skips_metrics_path only covers the positive case.

Nits

  • ocr_skipped_pages_total is a counter exposed without labels — fine. No information disclosure.
  • Custom-metric label cardinality: engine (2 values), script_type (4-ish), kind (2), outcome (2). Bounded and safe; no risk of metric-name explosion from user input.
  • metrics.py is a clean, auditable module — exactly the "centralize security-adjacent config" pattern I want. Good.
## Nora "NullX" Steiner — Application Security Engineer **Verdict: Approved with concerns** Mostly clean. Custom metrics carry no PII (counters/gauges over engine/script_type/kind/outcome labels, no usernames, no document IDs, no PDF URLs). The training endpoints still gate on `X-Training-Token`. No new injection surface, no shell exec, no untrusted deserialization. ### Blockers None. ### Concerns 1. **`ocr-service/main.py:82` — unauthenticated `/metrics` endpoint.** `Instrumentator().instrument(app).expose(app)` mounts `/metrics` with no auth and no IP allowlist. The default exporter exposes: - all custom OCR counters/histograms (low-sensitivity but enables traffic analysis) - the auto-instrumented `http_requests_total{handler=...}` — this reveals every URL path the service exposes, including the existence of `/train`, `/train-sender`, `/segtrain`. An attacker who reaches the metrics endpoint learns the service topology for free. Threat model: today `ocr:8000` is only reachable from inside the compose network, so this is acceptable. The risk is operational drift — if someone ever routes ocr-service through Caddy, this becomes public information disclosure (CWE-200). Mitigation, in order of preference: - At Caddy: `respond /metrics 404` for any future public route (coordinate with Tobias). - In code: bind `/metrics` to a separate management port via `Instrumentator(...).expose(app, endpoint="/metrics", include_in_schema=False, tags=["internal"])` and document the internal-network requirement in a comment above line 82 stating the threat model (per my "comments explain the threat model" rule). 2. **`ocr-service/main.py:93-97` — `MetricsPathFilter` reads `record.args[2]` assuming uvicorn's positional format.** Defensive but path-tolerant — good. However, if uvicorn ever changes its format string to keyword args, this silently turns off and `/metrics` requests get logged with bearer tokens / query strings in the URL line. Add an assertion test for the negative case (e.g. `record.args = None` or arity < 3) — your current `test_uvicorn_access_log_filter_skips_metrics_path` only covers the positive case. ### Nits - `ocr_skipped_pages_total` is a counter exposed without labels — fine. No information disclosure. - Custom-metric label cardinality: `engine` (2 values), `script_type` (4-ish), `kind` (2), `outcome` (2). Bounded and safe; no risk of metric-name explosion from user input. - `metrics.py` is a clean, auditable module — exactly the "centralize security-adjacent config" pattern I want. Good.
Author
Owner

Sara Holt — Senior QA Engineer

Verdict: Approved with concerns

16 new tests, each with a sentence-name, fresh CollectorRegistry() per test for state isolation, ASGITransport (no docker, fast), assertions on exact counter values. This is what well-tested infrastructure code looks like — thank you. Test pyramid is correct: unit + in-process API tests, no E2E bloat.

Blockers

None.

Concerns

  1. test_metrics.py:308-329test_ocr_training_runs_total_incremented_with_recognition_error_label patches main.asyncio.to_thread globally with a side-effect that always raises. Problem: /train does ZIP validation and file IO before reaching the training runner — both of which use asyncio.to_thread in some code paths via FastAPI/Starlette. The test currently passes because the runner is the first thing patched, but it's coincidental coupling. Patch the inner _run_training symbol instead (e.g. patch("main._run_training", side_effect=RuntimeError(...)) if it can be made addressable, or capture and assert that the exception is in fact propagating from inside the wrapped runner, not from the surrounding endpoint).

  2. Missing coverage:

    • No test for train_sender_model (/train-sender) — both the success and error branch increment ocr_training_runs_total{kind=recognition, ...} but the diff only tests /train and /segtrain. If someone changes the label to kind=sender later, no test catches it.
    • No test for the ocr_model_accuracy Gauge being not set when result["accuracy"] is None. The if result.get("accuracy") is not None: guard on lines 503/590/707 is uncovered.
    • No test for the unguided-stream path emitting ocr_processing_seconds. test_ocr_processing_seconds_histogram_observed_per_page_in_stream covers the regular path; the guided path on line 220 also observes the histogram and is untested.
  3. test_metrics.py:139-148, 175-184, 286-296, ... — six tests duplicate the same with patch(...), patch(...), ... : block with five-to-six entries. This is the rule-of-three for tests too. Extract a @pytest.fixture that yields a context-managed stack of patches, or a helper function returning the configured client. Right now any change to e.g. the _download_and_convert_pdf patch target requires updating six test bodies.

Nits

  • The pre-existing test_startup_logs_warning_when_running_as_root failure on main — fine to call out in the PR body, but please open a Gitea issue for it so it does not get forgotten when ASGITransport upgrades start running lifespan by default.
  • _histogram_count_sum reads child._sum.get() and child._buckets — touching prometheus_client internals. Works today, but pin prometheus-client in requirements.txt (currently transitive) so an upgrade does not silently break these assertions.
  • Consider adding the test plan checkbox curl ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words...' lists the wired metrics as an automated smoke test in CI — that would close the loop on the manual verification still pending in the PR description.
## Sara Holt — Senior QA Engineer **Verdict: Approved with concerns** 16 new tests, each with a sentence-name, fresh `CollectorRegistry()` per test for state isolation, `ASGITransport` (no docker, fast), assertions on exact counter values. This is what well-tested infrastructure code looks like — thank you. Test pyramid is correct: unit + in-process API tests, no E2E bloat. ### Blockers None. ### Concerns 1. **`test_metrics.py:308-329` — `test_ocr_training_runs_total_incremented_with_recognition_error_label` patches `main.asyncio.to_thread` globally with a side-effect that always raises.** Problem: `/train` does ZIP validation and file IO before reaching the training runner — both of which use `asyncio.to_thread` in some code paths via FastAPI/Starlette. The test currently passes because the runner is the first thing patched, but it's coincidental coupling. Patch the inner `_run_training` symbol instead (e.g. `patch("main._run_training", side_effect=RuntimeError(...))` if it can be made addressable, or capture and assert that the exception is in fact propagating from inside the wrapped runner, not from the surrounding endpoint). 2. **Missing coverage:** - No test for `train_sender_model` (`/train-sender`) — both the success and error branch increment `ocr_training_runs_total{kind=recognition, ...}` but the diff only tests `/train` and `/segtrain`. If someone changes the label to `kind=sender` later, no test catches it. - No test for the `ocr_model_accuracy` Gauge being **not** set when `result["accuracy"] is None`. The `if result.get("accuracy") is not None:` guard on lines 503/590/707 is uncovered. - No test for the unguided-stream path emitting `ocr_processing_seconds`. `test_ocr_processing_seconds_histogram_observed_per_page_in_stream` covers the regular path; the guided path on line 220 also observes the histogram and is untested. 3. **`test_metrics.py:139-148, 175-184, 286-296, ...` — six tests duplicate the same `with patch(...), patch(...), ... :` block with five-to-six entries.** This is the rule-of-three for tests too. Extract a `@pytest.fixture` that yields a context-managed stack of patches, or a helper function returning the configured client. Right now any change to e.g. the `_download_and_convert_pdf` patch target requires updating six test bodies. ### Nits - The pre-existing `test_startup_logs_warning_when_running_as_root` failure on `main` — fine to call out in the PR body, but please open a Gitea issue for it so it does not get forgotten when ASGITransport upgrades start running lifespan by default. - `_histogram_count_sum` reads `child._sum.get()` and `child._buckets` — touching `prometheus_client` internals. Works today, but pin `prometheus-client` in `requirements.txt` (currently transitive) so an upgrade does not silently break these assertions. - Consider adding the test plan checkbox `curl ocr:8000/metrics | grep -E '^ocr_(jobs|pages|words...' lists the wired metrics` as an automated smoke test in CI — that would close the loop on the manual verification still pending in the PR description.
Author
Owner

Leonie Voss — UI/UX Design Lead

Verdict: Approved — out of scope for UI/UX

This PR is entirely backend (ocr-service/main.py, metrics.py, requirements.txt, prometheus.yml, test_metrics.py). No SvelteKit components, no Tailwind classes, no user-facing strings, no Paraglide message keys, no accessibility surface.

What I checked

  • No changes under frontend/ — confirmed
  • No new error codes that would need i18n keys in messages/{de,en,es}.json
  • No new user-facing flows (the /metrics endpoint is an operator/Prometheus surface, not a user surface)
  • Out-of-scope item "Grafana panels — tracked in #651" — when that lands I will review the dashboard for senior-readable typography, color-only-status anti-patterns, and 320px viewport behavior on phones

Forward note

When the Grafana dashboards (#651) ship: please use brand-aligned panel colors (avoid traffic-light reds alone for status — pair with icons and labels), keep typography ≥ 12px on every panel, and verify dashboards at 320px width because we have operators on phones too.

LGTM from a UX standpoint.

## Leonie Voss — UI/UX Design Lead **Verdict: Approved — out of scope for UI/UX** This PR is entirely backend (`ocr-service/main.py`, `metrics.py`, `requirements.txt`, `prometheus.yml`, `test_metrics.py`). No SvelteKit components, no Tailwind classes, no user-facing strings, no Paraglide message keys, no accessibility surface. ### What I checked - No changes under `frontend/` — confirmed - No new error codes that would need i18n keys in `messages/{de,en,es}.json` - No new user-facing flows (the `/metrics` endpoint is an operator/Prometheus surface, not a user surface) - Out-of-scope item "Grafana panels — tracked in #651" — when that lands I will review the dashboard for senior-readable typography, color-only-status anti-patterns, and 320px viewport behavior on phones ### Forward note When the Grafana dashboards (#651) ship: please use brand-aligned panel colors (avoid traffic-light reds alone for status — pair with icons and labels), keep typography ≥ 12px on every panel, and verify dashboards at 320px width because we have operators on phones too. LGTM from a UX standpoint.
Author
Owner

Elicit — Requirements Engineer (Brownfield mode)

Verdict: Approved with concerns

The PR body maps cleanly to the ACs on #652 (AC2 through AC7 all called out with specific line/decision references). Decision points #1, #2, #3 from the issue are explicitly resolved with rationale in the description. This is the level of traceability I expect — well done.

Blockers

  1. Two unchecked items in the PR's own Test plan remain unverified:

    • [ ] After image rebuild on the deployment host: curl ocr:8000/metrics returns 200 and the ocr-service target shows **UP** at http://localhost:9090/targets
    • [ ] After one POST /ocr request: ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_ are listed

    These are part of the Definition of Done for #652 (AC1, AC7). Either tick them with evidence (screenshot, curl transcript pasted into the PR comments) before merge, or move them to a follow-up issue with an owner and a date. An unchecked Test plan box is a TBD that quietly turns into requirements debt.

  2. "Alerting rules — flagged for a follow-up by Tobias" — no Gitea issue linked. Per the workflow convention, every follow-up needs an issue number in the PR description. Open the follow-up issue ("Add Prometheus alerting rules for ocr_models_ready < 1, ocr_skipped_pages_total rate, ocr_training_runs_total{outcome=error}") and link it here, otherwise it disappears.

Concerns

  1. NFR coverage check against #652:

    • Observability (the main NFR being delivered): met.
    • Performance: the histogram timing brackets the engine call but is not asserted to add < N ms overhead per request. Add an NFR threshold to a follow-up issue, e.g. "Instrumentation overhead < 5 ms p95 per OCR request."
    • Privacy: no PII in metric labels — confirmed by Nora's review. Worth recording explicitly in docs/OBSERVABILITY.md so future contributors know not to add e.g. user_id as a label.
    • Reliability: no test for what happens if prometheus_client registry-collision occurs (e.g. two build_metrics(REGISTRY) calls). Today this is impossible because main.py is imported once, but it's a brittle invariant.
  2. Glossary gap. docs/GLOSSARY.md should now list: OCR metrics, illegible-word ratio, model accuracy gauge, models-ready gauge. Operators reading dashboards will not know what "illegible words" means without it.

Nits (optional)

  • Consider tightening AC6 wording in #652's done-criteria from "training counter is incremented" to "training counter is incremented within the same request lifecycle as the training response is sent" — current implementation does this, but the AC does not pin it down.
  • The PR title is correctly verb-noun ("expose Prometheus /metrics endpoint with OCR-domain counters"). Good.
## Elicit — Requirements Engineer (Brownfield mode) **Verdict: Approved with concerns** The PR body maps cleanly to the ACs on #652 (AC2 through AC7 all called out with specific line/decision references). Decision points #1, #2, #3 from the issue are explicitly resolved with rationale in the description. This is the level of traceability I expect — well done. ### Blockers 1. **Two unchecked items in the PR's own Test plan remain unverified:** - `[ ] After image rebuild on the deployment host: curl ocr:8000/metrics returns 200 and the ocr-service target shows **UP** at http://localhost:9090/targets` - `[ ] After one POST /ocr request: ocr_(jobs|pages|words|illegible_words|processing_seconds|models_ready)_ are listed` These are part of the Definition of Done for #652 (AC1, AC7). Either tick them with evidence (screenshot, curl transcript pasted into the PR comments) before merge, or move them to a follow-up issue with an owner and a date. An unchecked Test plan box is a TBD that quietly turns into requirements debt. 2. **"Alerting rules — flagged for a follow-up by Tobias" — no Gitea issue linked.** Per the workflow convention, every follow-up needs an issue number in the PR description. Open the follow-up issue ("Add Prometheus alerting rules for ocr_models_ready < 1, ocr_skipped_pages_total rate, ocr_training_runs_total{outcome=error}") and link it here, otherwise it disappears. ### Concerns 3. **NFR coverage check against #652:** - Observability (the main NFR being delivered): **met**. - Performance: the histogram timing brackets the engine call but is not asserted to add < N ms overhead per request. Add an NFR threshold to a follow-up issue, e.g. "Instrumentation overhead < 5 ms p95 per OCR request." - Privacy: no PII in metric labels — confirmed by Nora's review. Worth recording explicitly in `docs/OBSERVABILITY.md` so future contributors know not to add e.g. `user_id` as a label. - Reliability: no test for what happens if `prometheus_client` registry-collision occurs (e.g. two `build_metrics(REGISTRY)` calls). Today this is impossible because `main.py` is imported once, but it's a brittle invariant. 4. **Glossary gap.** `docs/GLOSSARY.md` should now list: OCR metrics, illegible-word ratio, model accuracy gauge, models-ready gauge. Operators reading dashboards will not know what "illegible words" means without it. ### Nits (optional) - Consider tightening AC6 wording in #652's done-criteria from "training counter is incremented" to "training counter is incremented within the same request lifecycle as the training response is sent" — current implementation does this, but the AC does not pin it down. - The PR title is correctly verb-noun ("expose Prometheus /metrics endpoint with OCR-domain counters"). Good.
marcel added 11 commits 2026-05-21 17:07:16 +02:00
Locks in the post-download placement of the counter increment so a
regression that moves it back above _download_and_convert_pdf would fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the guided generator's page_started timer wrapped the entire
region loop including the synchronous correct_text() call, inflating
ocr_processing_seconds with spell-check latency. Sum the per-region
engine.extract_region_text durations instead so the histogram matches
the unguided stream's "engine only" semantic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two regression tests:
- /train-sender hitting the success path bumps the recognition counter
  (previously only /train and /segtrain were covered).
- A successful run whose result.accuracy is None must not call set() on
  ocr_model_accuracy — the gauge stays at its default 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If uvicorn's access log format ever changes (args=None, or shorter
than 3 elements), the filter must keep forwarding records rather than
silently dropping them. Two extra LogRecords cover both edge cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The asyncio.to_thread patch stubbed out the entire _run_training call,
hiding the real error path. Replacing it with a failing CompletedProcess
from subprocess.run exercises the actual ketos-failed branch and keeps
the test's intent — error counter bumps, 500 surfaces — intact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two block-iteration loops (/ocr and /ocr/stream's standard generator)
both ran the same word-total and illegible-word increments. Lift them
into a single helper so each call site becomes one line and the counter
intent reads cleanly. Pure refactor — no behavior change, tests stay green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The /train, /train-sender, and /segtrain endpoints each duplicated the
same eight-line try/except + counter + gauge block around the
asyncio.to_thread call. Lift it into _record_training(runner, kind),
which accepts a sync- or async-returning callable for flexibility.
Each endpoint now ends with a single return line. Behaviour preserved —
status codes, error propagation, and metric labels stay identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each metrics test was repeating the same five-line block — patch
kraken_engine.load_models, patch load_spell_checker, instantiate the
AsyncClient, force _models_ready True, restore it. Lift the lot into a
single async context manager so each test body shrinks to its real
arrange / act / assert intent.

Tests that drive the lifespan directly (models_ready gauge) or stub
asyncio.to_thread for /train (which already patches _models_ready) stay
unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three small drops that pay back later:
- Note that main.metrics is import-time bound and tests must
  monkeypatch `main.metrics`, not the registry.
- Flag the /metrics endpoint as unauthenticated and cross-link the
  Caddy-block snippet in docs/OBSERVABILITY.md.
- Pin prometheus-client to the exact 0.25.0 patch version already
  resolved by prometheus-fastapi-instrumentator 7.0.0, so an upstream
  bump cannot silently slip in.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge
  (plus the previously-undrawn Prometheus -> backend edge for symmetry).
- OBSERVABILITY.md gains a full ocr_* metrics table with labels, units,
  and the canonical example queries from issue #652.
- New "Internal-only endpoints" subsection captures the unauthenticated
  /metrics caveat and provides the Caddy block snippet for the case
  where the service ever gets a host port.
- Explicit note that MetricsPathFilter only quiets uvicorn stdout, and
  the OCR metrics must never carry PII or document content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs: add ADR-023 and glossary entries for OCR metrics
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m33s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m29s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
2df71beb7e
ADR-023 captures why prometheus-fastapi-instrumentator was chosen,
the build_metrics(registry) factory pattern, and the test rebinding
seam. The glossary gains four ops-aligned terms — illegible word,
models-ready gauge, recognition vs segmentation accuracy — so the
metrics documentation in OBSERVABILITY.md has a vocabulary to lean on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

Local verification — B3 evidence

Ran the unchecked test-plan items against the worktree's ocr-service/.venv because docker compose up -d --build ocr would have required rebuilding the Surya/Kraken model image on this laptop (multi-GB). Substituted uvicorn main:app --port 8000 inside the worktree venv with KRAKEN_MODEL_PATH=/nonexistent so the lifespan startup runs without downloading the Kraken model. Same FastAPI app, same Instrumentator(...).expose(app) wiring, same MetricsPathFilter — only the runtime is host-python instead of container-python.

1. GET /metrics returns 200

$ curl -s -o /dev/null -w "HTTP %{http_code}\n" http://127.0.0.1:8000/metrics
HTTP 200

2. All nine ocr_* metric families are exposed

$ curl -s http://127.0.0.1:8000/metrics | grep -E "^# (HELP|TYPE) ocr_"
# HELP ocr_jobs_total Number of OCR jobs processed, labelled by engine and script type.
# TYPE ocr_jobs_total counter
# HELP ocr_pages_total Number of pages successfully OCR'd, labelled by engine.
# TYPE ocr_pages_total counter
# HELP ocr_skipped_pages_total Number of pages skipped because the OCR engine raised.
# TYPE ocr_skipped_pages_total counter
# HELP ocr_words_total Number of words recognized across all OCR blocks.
# TYPE ocr_words_total counter
# HELP ocr_illegible_words_total Number of words below the confidence threshold (replaced with [unleserlich]).
# TYPE ocr_illegible_words_total counter
# HELP ocr_processing_seconds OCR processing time per page (streaming) or per document (non-streaming).
# TYPE ocr_processing_seconds histogram
# HELP ocr_training_runs_total Number of training runs, labelled by kind (recognition|segmentation) and outcome (success|error).
# TYPE ocr_training_runs_total counter
# HELP ocr_model_accuracy Latest model accuracy reported by a successful training run.
# TYPE ocr_model_accuracy gauge
# HELP ocr_models_ready 1 once the lifespan startup has finished loading models, 0 before.
# TYPE ocr_models_ready gauge

ocr_models_ready 1.0 — confirms the lifespan flip is wired correctly.

3. POST /ocr increments http_requests_total

Drove a request against the SSRF guard (bogus host) so no OCR work runs but the FastAPI middleware still observes the request.

$ curl -s -X POST http://127.0.0.1:8000/ocr \
    -H "Content-Type: application/json" \
    -d '{"pdfUrl":"http://evil.example.com/x.pdf","scriptType":"TYPEWRITER","language":"de"}'
{"detail":"PDF host not allowed: evil.example.com"}

$ curl -s http://127.0.0.1:8000/metrics | grep -E "^http_requests_total"
http_requests_total{handler="/ocr",method="POST",status="4xx"} 1.0

Instrumentator(...).instrument(app).expose(app) is doing its job — http_* metrics appear after the request, exactly as the PR body's grep expects.

Not exercised locally

  • The Prometheus /targets UP/DOWN check (http://localhost:9090/targets) needs the observability stack. That stack is not part of docker-compose.yml; it lives in docker-compose.observability.yml per ADR-009 and ADR-016. Verifying the scrape edge will happen on deploy. The scrape job is already in infra/observability/prometheus/prometheus.yml (committed earlier on this branch: e75ac8ec ops(observability): drop TODO from ocr-service scrape job in prometheus.yml).

🤖 Generated with Claude Code

## Local verification — B3 evidence Ran the unchecked test-plan items against the worktree's `ocr-service/.venv` because `docker compose up -d --build ocr` would have required rebuilding the Surya/Kraken model image on this laptop (multi-GB). Substituted `uvicorn main:app --port 8000` inside the worktree venv with `KRAKEN_MODEL_PATH=/nonexistent` so the lifespan startup runs without downloading the Kraken model. Same FastAPI app, same `Instrumentator(...).expose(app)` wiring, same `MetricsPathFilter` — only the runtime is host-python instead of container-python. ### 1. `GET /metrics` returns 200 ``` $ curl -s -o /dev/null -w "HTTP %{http_code}\n" http://127.0.0.1:8000/metrics HTTP 200 ``` ### 2. All nine `ocr_*` metric families are exposed ``` $ curl -s http://127.0.0.1:8000/metrics | grep -E "^# (HELP|TYPE) ocr_" # HELP ocr_jobs_total Number of OCR jobs processed, labelled by engine and script type. # TYPE ocr_jobs_total counter # HELP ocr_pages_total Number of pages successfully OCR'd, labelled by engine. # TYPE ocr_pages_total counter # HELP ocr_skipped_pages_total Number of pages skipped because the OCR engine raised. # TYPE ocr_skipped_pages_total counter # HELP ocr_words_total Number of words recognized across all OCR blocks. # TYPE ocr_words_total counter # HELP ocr_illegible_words_total Number of words below the confidence threshold (replaced with [unleserlich]). # TYPE ocr_illegible_words_total counter # HELP ocr_processing_seconds OCR processing time per page (streaming) or per document (non-streaming). # TYPE ocr_processing_seconds histogram # HELP ocr_training_runs_total Number of training runs, labelled by kind (recognition|segmentation) and outcome (success|error). # TYPE ocr_training_runs_total counter # HELP ocr_model_accuracy Latest model accuracy reported by a successful training run. # TYPE ocr_model_accuracy gauge # HELP ocr_models_ready 1 once the lifespan startup has finished loading models, 0 before. # TYPE ocr_models_ready gauge ``` `ocr_models_ready 1.0` — confirms the lifespan flip is wired correctly. ### 3. `POST /ocr` increments `http_requests_total` Drove a request against the SSRF guard (bogus host) so no OCR work runs but the FastAPI middleware still observes the request. ``` $ curl -s -X POST http://127.0.0.1:8000/ocr \ -H "Content-Type: application/json" \ -d '{"pdfUrl":"http://evil.example.com/x.pdf","scriptType":"TYPEWRITER","language":"de"}' {"detail":"PDF host not allowed: evil.example.com"} $ curl -s http://127.0.0.1:8000/metrics | grep -E "^http_requests_total" http_requests_total{handler="/ocr",method="POST",status="4xx"} 1.0 ``` `Instrumentator(...).instrument(app).expose(app)` is doing its job — `http_*` metrics appear after the request, exactly as the PR body's grep expects. ### Not exercised locally - The Prometheus `/targets` UP/DOWN check (`http://localhost:9090/targets`) needs the observability stack. That stack is **not** part of `docker-compose.yml`; it lives in `docker-compose.observability.yml` per ADR-009 and ADR-016. Verifying the scrape edge will happen on deploy. The scrape job is already in `infra/observability/prometheus/prometheus.yml` (committed earlier on this branch: `e75ac8ec ops(observability): drop TODO from ocr-service scrape job in prometheus.yml`). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Owner

Follow-up issues filed

  • #654 — Add Prometheus alerting rules for OCR service (feature + devops + phase-7: monitoring)
  • #655 — Pre-existing test_startup_logs_warning_when_running_as_root fails under ASGITransport (bug + devops)
  • #656 — Add instrumentation-overhead NFR to OCR service (feature + devops)

🤖 Generated with Claude Code

## Follow-up issues filed - [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654) — Add Prometheus alerting rules for OCR service (`feature` + `devops` + `phase-7: monitoring`) - [#655](https://git.raddatz.cloud/marcel/familienarchiv/issues/655) — Pre-existing `test_startup_logs_warning_when_running_as_root` fails under `ASGITransport` (`bug` + `devops`) - [#656](https://git.raddatz.cloud/marcel/familienarchiv/issues/656) — Add instrumentation-overhead NFR to OCR service (`feature` + `devops`) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Owner

🔁 Cycle 2 re-review — Felix Brandt

Verdict: Approved

Both blockers from my cycle-1 review are resolved and the new helpers are exactly the shape I asked for. The two concerns are also handled — partially via code change, partially via a clarifying test that locks in the chosen definition.

Cycle-1 blockers — resolved

  1. Word-counter duplication → fixed. _observe_block_words(words, threshold) extracted at ocr-service/main.py:89-99, called from three sites (run_ocr, unguided stream loop, no remaining duplication). Docstring even documents the words non-empty precondition. The helper is branch-free at the call sites, exactly the shape I sketched. Three sites → one helper, clean rule-of-three resolution.

  2. Training try/except duplication → fixed. _record_training(runner, kind) extracted at ocr-service/main.py:59-79. All three endpoints now end with one line:

    • /train: return await _record_training(lambda: asyncio.to_thread(_run_training), kind="recognition")
    • /train-sender: same, kind="recognition"
    • /segtrain: same, kind="segmentation"
      The error counter, success counter, and accuracy gauge are all driven from one place — future training endpoints add metrics for free.

Cycle-1 concerns — addressed

  1. Guided stream page_started after preprocess → resolved by choosing one definition and locking it in with a test. The guided path now sums time.monotonic() deltas around each engine.extract_region_text call (main.py:263, 269) and observes the sum once per page — engine time only, not preprocess, not spell-check. The histogram docstring in metrics.py:69 matches ("excluding preprocessing"), and test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream proves spell-check is excluded. Consistent with the unguided path (page_started set after preprocess, observed once around extract_page_blocks). Good.

  2. ocr_jobs_total.inc() before download in stream → resolved by ordering. Looking at the current run_ocr_stream: _download_and_convert_pdf is awaited on line 219 before the metrics.ocr_jobs_total.labels(...).inc() on line 230. If the download raises (SSRF guard, HTTP error), the counter is not incremented. test_ocr_jobs_total_not_incremented_when_pdf_download_fails_in_stream (new) proves this. Counter now records jobs that at least reached page-loop entry, not jobs that died at download — matches run_ocr's post-extract increment closely enough for analytics.

New issues from the fix commits

None blocking, two minor observations:

  • _record_training accepts both sync and async runners via inspect.isawaitable (main.py:69-71), but every call site passes lambda: asyncio.to_thread(...), which is always a coroutine. The sync branch is dead code at the three call sites we have today. I'd simplify the signature to runner: Callable[[], Awaitable[dict]] and drop the inspect import — KISS. Skip if you anticipate a future sync caller, but right now it's anticipating a use case that doesn't exist.

  • The new _record_training swallows the runner result's "loss"/"cer"/"epochs" fields and only inspects accuracy. That's fine — those other fields are returned as-is to the HTTP client — but the docstring on lines 64-66 says "Returns the result"; for clarity I'd note that only accuracy is observed; the rest of the dict is opaque.

Test additions verified

The cycle-1 test gaps are filled: test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender, test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy, test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream. The error-path test now patches subprocess.run instead of asyncio.to_thread (test_ocr_training_runs_total_incremented_with_recognition_error_label) — narrower seam, less coincidental coupling, exactly what Sara asked for.

Verification evidence (#11617)

Sufficient for the AC1/AC7 checkboxes. curl /metrics returns 200, all nine ocr_* families present with HELP/TYPE banners, ocr_models_ready 1.0 confirms the lifespan flip wires through, http_requests_total{handler="/ocr"} proves the instrumentator middleware fires. The /targets UP check is correctly deferred with a reason — the observability stack is a separate compose file and that's an on-deploy verification, not a local one.

LGTM. Ship it.

## 🔁 Cycle 2 re-review — Felix Brandt **Verdict: ✅ Approved** Both blockers from [my cycle-1 review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11600) are resolved and the new helpers are exactly the shape I asked for. The two concerns are also handled — partially via code change, partially via a clarifying test that locks in the chosen definition. ### Cycle-1 blockers — resolved 1. **Word-counter duplication** → fixed. `_observe_block_words(words, threshold)` extracted at `ocr-service/main.py:89-99`, called from three sites (`run_ocr`, unguided stream loop, no remaining duplication). Docstring even documents the `words` non-empty precondition. The helper is branch-free at the call sites, exactly the shape I sketched. Three sites → one helper, clean rule-of-three resolution. 2. **Training try/except duplication** → fixed. `_record_training(runner, kind)` extracted at `ocr-service/main.py:59-79`. All three endpoints now end with one line: - `/train`: `return await _record_training(lambda: asyncio.to_thread(_run_training), kind="recognition")` - `/train-sender`: same, `kind="recognition"` - `/segtrain`: same, `kind="segmentation"` The error counter, success counter, and accuracy gauge are all driven from one place — future training endpoints add metrics for free. ### Cycle-1 concerns — addressed 3. **Guided stream `page_started` after preprocess** → resolved by choosing one definition and locking it in with a test. The guided path now sums `time.monotonic()` deltas around each `engine.extract_region_text` call (`main.py:263, 269`) and observes the sum once per page — engine time only, not preprocess, not spell-check. The histogram docstring in `metrics.py:69` matches ("excluding preprocessing"), and `test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream` proves spell-check is excluded. Consistent with the unguided path (`page_started` set after preprocess, observed once around `extract_page_blocks`). Good. 4. **`ocr_jobs_total.inc()` before download in stream** → resolved by ordering. Looking at the current `run_ocr_stream`: `_download_and_convert_pdf` is awaited on line 219 before the `metrics.ocr_jobs_total.labels(...).inc()` on line 230. If the download raises (SSRF guard, HTTP error), the counter is not incremented. `test_ocr_jobs_total_not_incremented_when_pdf_download_fails_in_stream` (new) proves this. Counter now records jobs that at least reached page-loop entry, not jobs that died at download — matches `run_ocr`'s post-extract increment closely enough for analytics. ### New issues from the fix commits None blocking, two minor observations: - **`_record_training` accepts both sync and async runners via `inspect.isawaitable` (main.py:69-71)**, but every call site passes `lambda: asyncio.to_thread(...)`, which is always a coroutine. The sync branch is dead code at the three call sites we have today. I'd simplify the signature to `runner: Callable[[], Awaitable[dict]]` and drop the `inspect` import — KISS. Skip if you anticipate a future sync caller, but right now it's anticipating a use case that doesn't exist. - **The new `_record_training` swallows the runner result's "loss"/"cer"/"epochs" fields and only inspects `accuracy`.** That's fine — those other fields are returned as-is to the HTTP client — but the docstring on lines 64-66 says "Returns the result"; for clarity I'd note that only `accuracy` is observed; the rest of the dict is opaque. ### Test additions verified The cycle-1 test gaps are filled: `test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender`, `test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy`, `test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream`. The error-path test now patches `subprocess.run` instead of `asyncio.to_thread` (`test_ocr_training_runs_total_incremented_with_recognition_error_label`) — narrower seam, less coincidental coupling, exactly what Sara asked for. ### Verification evidence (#11617) Sufficient for the AC1/AC7 checkboxes. `curl /metrics` returns 200, all nine `ocr_*` families present with HELP/TYPE banners, `ocr_models_ready 1.0` confirms the lifespan flip wires through, `http_requests_total{handler="/ocr"}` proves the instrumentator middleware fires. The `/targets` UP check is correctly deferred with a reason — the observability stack is a separate compose file and that's an on-deploy verification, not a local one. LGTM. Ship it.
Author
Owner

🔁 Cycle 2 re-review — Markus Keller

Verdict: Approved

Every doc-update blocker from my cycle-1 review is resolved, and the new ADR is the right level of detail. Boring, well-documented, single-purpose change. This is how I want infrastructure additions delivered.

Cycle-1 blockers — resolved

  1. docs/architecture/c4/l2-containers.puml → updated. Two new relationships added, not just the OCR one I asked for:

    Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus")
    Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")
    

    The backend arrow was missing on the diagram before — fixing both at once was the right move. C4 L2 now accurately reflects what Prometheus scrapes.

  2. docs/OBSERVABILITY.md → updated with a complete metric reference. Every custom metric is listed with type, labels, unit, and what it tracks. The PII warning at the top of the table ("Never label these with PII or document content") is exactly the operational guardrail I'd expect. Canonical PromQL examples for throughput, illegible-word ratio, p95 timing, training error rate, and accuracy are included — copy-paste-ready for the dashboard work in #651. The Caddy block snippet for respond /metrics 404 is also documented — Tobias will need that the day someone exposes ocr.example.com.

  3. ADR for the factory + registry-injection pattern → added as docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md. Context, Decision, Consequences (positive/negative), Alternatives Considered — all four sections present and tight. The "Alternatives considered" section is particularly good: it explicitly names why the autouse-fixture and the per-call-arg-registry alternatives were rejected, so a future maintainer cannot accidentally undo the decision. Cross-links to issue, library, and code. This is the ADR template applied correctly.

Cycle-1 concerns — addressed

  1. Module-level side effect at import time (metrics: OcrMetrics = build_metrics(REGISTRY)) → documented with an inline comment on main.py:47-50 explaining why it must be at import time (instrumentator registry binding) and what tests must do to override it (monkeypatch main.metrics, not the registry). The ADR's "Negative consequences" section also names the same trap. Acceptable — would still prefer this in the module docstring of metrics.py rather than only in main.py, but the in-place comment is enough.

  2. Unauthenticated /metrics → documented with a threat-model comment on main.py:121-125 and a Caddy block in OBSERVABILITY.md. The comment names CWE-200 explicitly and gates on ports: being added — exactly the operational invariant I wanted captured. ADR-023's "Negative consequences" cross-links to it.

New issues from the fix commits

None. The doc updates are factually correct, consistent with each other, and consistent with the code:

  • L2 PUML arrow direction matches prometheus.yml's scrape config.
  • OBSERVABILITY.md table labels match metrics.py's collector definitions exactly.
  • ADR-023's claim that "prometheus-client==0.25.0" is now pinned is verified in requirements.txt.

I checked the diff line-by-line for the kind of doc-vs-code contradiction that creeps in during fix loops (e.g. doc says "labelled by foo" but code labels by bar). None found.

Glossary check

docs/GLOSSARY.md got four new entries: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. The "Segmentation model accuracy" entry explicitly notes the distinction from recognition accuracy — good, future-Sara will not have to dig through ADRs to figure out why the two kind= label values exist.

Verification evidence (#11617)

The PR did the right thing — the /targets UP check is correctly deferred because the observability stack lives in a separate compose file per ADR-009 / ADR-016. Running it against the host venv with the same FastAPI app is a valid substitute for verifying the instrumentator wiring. The local verification covers what's verifiable locally; the post-deploy verification belongs in the deploy runbook, not in the PR.

Ship it.

## 🔁 Cycle 2 re-review — Markus Keller **Verdict: ✅ Approved** Every doc-update blocker from [my cycle-1 review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11602) is resolved, and the new ADR is the right level of detail. Boring, well-documented, single-purpose change. This is how I want infrastructure additions delivered. ### Cycle-1 blockers — resolved 1. **`docs/architecture/c4/l2-containers.puml`** → updated. Two new relationships added, not just the OCR one I asked for: ``` Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus") Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics") ``` The backend arrow was missing on the diagram before — fixing both at once was the right move. C4 L2 now accurately reflects what Prometheus scrapes. 2. **`docs/OBSERVABILITY.md`** → updated with a complete metric reference. Every custom metric is listed with type, labels, unit, and what it tracks. The PII warning at the top of the table ("Never label these with PII or document content") is exactly the operational guardrail I'd expect. Canonical PromQL examples for throughput, illegible-word ratio, p95 timing, training error rate, and accuracy are included — copy-paste-ready for the dashboard work in #651. The Caddy block snippet for `respond /metrics 404` is also documented — Tobias will need that the day someone exposes ocr.example.com. 3. **ADR for the factory + registry-injection pattern** → added as [`docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md`](docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md). Context, Decision, Consequences (positive/negative), Alternatives Considered — all four sections present and tight. The "Alternatives considered" section is particularly good: it explicitly names *why* the autouse-fixture and the per-call-arg-registry alternatives were rejected, so a future maintainer cannot accidentally undo the decision. Cross-links to issue, library, and code. This is the ADR template applied correctly. ### Cycle-1 concerns — addressed 4. **Module-level side effect at import time (`metrics: OcrMetrics = build_metrics(REGISTRY)`)** → documented with an inline comment on `main.py:47-50` explaining *why* it must be at import time (instrumentator registry binding) and what tests must do to override it (monkeypatch `main.metrics`, not the registry). The ADR's "Negative consequences" section also names the same trap. Acceptable — would still prefer this in the module docstring of `metrics.py` rather than only in `main.py`, but the in-place comment is enough. 5. **Unauthenticated `/metrics`** → documented with a threat-model comment on `main.py:121-125` and a Caddy block in OBSERVABILITY.md. The comment names CWE-200 explicitly and gates on `ports:` being added — exactly the operational invariant I wanted captured. ADR-023's "Negative consequences" cross-links to it. ### New issues from the fix commits None. The doc updates are factually correct, consistent with each other, and consistent with the code: - L2 PUML arrow direction matches `prometheus.yml`'s scrape config. - OBSERVABILITY.md table labels match `metrics.py`'s collector definitions exactly. - ADR-023's claim that "`prometheus-client==0.25.0`" is now pinned is verified in `requirements.txt`. I checked the diff line-by-line for the kind of doc-vs-code contradiction that creeps in during fix loops (e.g. doc says "labelled by foo" but code labels by bar). None found. ### Glossary check `docs/GLOSSARY.md` got four new entries: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. The "Segmentation model accuracy" entry explicitly notes the distinction from recognition accuracy — good, future-Sara will not have to dig through ADRs to figure out why the two `kind=` label values exist. ### Verification evidence (#11617) The PR did the right thing — the `/targets` UP check is correctly deferred because the observability stack lives in a separate compose file per ADR-009 / ADR-016. Running it against the host venv with the same FastAPI app is a valid substitute for verifying the instrumentator wiring. The local verification covers what's verifiable locally; the post-deploy verification belongs in the deploy runbook, not in the PR. Ship it.
Author
Owner

🔁 Cycle 2 re-review — Tobias Wendt

Verdict: Approved

All five items from my cycle-1 review are either resolved in this PR or filed as a tracked follow-up. This is operationally complete.

Cycle-1 blockers — resolved

  1. Prometheus /targets UP verification → addressed pragmatically. The PR's own test plan items are now backed by the verification comment in #11617:

    • GET /metrics returns HTTP 200 ✓ (curl output included)
    • All nine ocr_* HELP/TYPE banners present ✓
    • ocr_models_ready 1.0 after lifespan ✓ (proves the gauge flip)
    • http_requests_total{handler="/ocr",method="POST",status="4xx"} 1.0 after a request ✓ (proves the instrumentator middleware fires)

    The /targets UP check at localhost:9090/targets is correctly deferred. The observability stack lives in docker-compose.observability.yml, not the main stack — Marcel cannot run it on his laptop without a multi-GB Surya/Kraken image rebuild. Verifying the scrape edge on deploy is the right call. I'd want a one-line in docs/DEPLOYMENT.md for the on-deploy verification step ("after deploy, check http://prom.example.com/targets shows ocr-service UP within 30s"), but that's a small follow-up rather than a blocker.

  2. Alerting follow-up issue → filed as #654. I read it. It has three concrete rules with thresholds (ocr_models_ready < 1 for 2m, >10% skipped page rate, >50% training error rate), correct for: durations, severity labels, PromQL that uses the right metric names, and a promtool check rules AC. This is a usable follow-up, not a placeholder. Labels (devops, feature, phase-7: monitoring) are correct.

Cycle-1 concerns — addressed

  1. Caddy block for /metrics and /health → added to docs/OBSERVABILITY.md. Snippet covers both endpoints in one @internal_only matcher. Good. I'd repeat this in docs/DEPLOYMENT.md next to the existing /actuator/* block so an ops engineer sees both routes in the same place, but the cross-reference from OBSERVABILITY.md is enough for now.

  2. prometheus-client transitive pin → fixed in ocr-service/requirements.txt:

    prometheus-fastapi-instrumentator==7.0.0
    prometheus-client==0.25.0
    

    Both pinned, both reproducible. Renovate will create bump PRs for either. ADR-023 documents the implication ("upgrading it requires an explicit bump and re-checking the instrumentator's compatibility range") — that's the correct operational note.

  3. MetricsPathFilter is uvicorn-only → documented in docs/OBSERVABILITY.md § Internal-only endpoints:

    Promtail/Loki still receive access lines from any other source. Treat the filter as console noise-control, not an audit-suppression mechanism.
    Exactly the operator-facing caveat I asked for. Future-me will not chase a "missing log" ghost in Loki.

New issues from the fix commits

None operationally. Spot-checks I did:

  • prometheus.yml diff is the cleanest possible (just removes the TODO block, scrape config unchanged) — no risk of breaking the existing backend scrape.
  • The MetricsPathFilter addFilter call is at module scope (main.py, after Instrumentator(...).expose(app)) — runs once per process, no double-registration on reloads.
  • excluded_handlers=["/health", "/metrics"] on the Instrumentator stops /health and /metrics themselves from polluting http_requests_total — would have been a silly cardinality leak if missed.
  • The two follow-up issues #655 (pre-existing ASGITransport lifespan bug) and #656 (instrumentation overhead NFR) are the right scope and the right labels. #656's "5 ms budget, calibrate first" is the right way to introduce a perf NFR — don't lock in a number you haven't measured.

One small forward note

When #654 lands, add rule_files: [rules/*.yml] to prometheus.yml and a docker-compose.observability.yml volume mount for infra/observability/prometheus/rules/. The current prometheus.yml does not mount a rules directory — that change is the prerequisite for #654, not part of it.

Ship it.

## 🔁 Cycle 2 re-review — Tobias Wendt **Verdict: ✅ Approved** All five items from [my cycle-1 review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11605) are either resolved in this PR or filed as a tracked follow-up. This is operationally complete. ### Cycle-1 blockers — resolved 1. **Prometheus `/targets` UP verification** → addressed pragmatically. The PR's own test plan items are now backed by the verification comment in [#11617](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11617): - `GET /metrics` returns HTTP 200 ✓ (curl output included) - All nine `ocr_*` HELP/TYPE banners present ✓ - `ocr_models_ready 1.0` after lifespan ✓ (proves the gauge flip) - `http_requests_total{handler="/ocr",method="POST",status="4xx"} 1.0` after a request ✓ (proves the instrumentator middleware fires) The `/targets` UP check at `localhost:9090/targets` is correctly deferred. The observability stack lives in `docker-compose.observability.yml`, not the main stack — Marcel cannot run it on his laptop without a multi-GB Surya/Kraken image rebuild. Verifying the scrape edge on deploy is the right call. I'd want a one-line in `docs/DEPLOYMENT.md` for the on-deploy verification step ("after deploy, check `http://prom.example.com/targets` shows ocr-service UP within 30s"), but that's a small follow-up rather than a blocker. 2. **Alerting follow-up issue** → filed as [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654). I read it. It has three concrete rules with thresholds (`ocr_models_ready < 1 for 2m`, `>10% skipped page rate`, `>50% training error rate`), correct `for:` durations, severity labels, PromQL that uses the right metric names, and a `promtool check rules` AC. This is a usable follow-up, not a placeholder. Labels (`devops`, `feature`, `phase-7: monitoring`) are correct. ### Cycle-1 concerns — addressed 3. **Caddy block for `/metrics` and `/health`** → added to `docs/OBSERVABILITY.md`. Snippet covers both endpoints in one `@internal_only` matcher. Good. I'd repeat this in `docs/DEPLOYMENT.md` next to the existing `/actuator/*` block so an ops engineer sees both routes in the same place, but the cross-reference from OBSERVABILITY.md is enough for now. 4. **`prometheus-client` transitive pin** → fixed in `ocr-service/requirements.txt`: ``` prometheus-fastapi-instrumentator==7.0.0 prometheus-client==0.25.0 ``` Both pinned, both reproducible. Renovate will create bump PRs for either. ADR-023 documents the implication ("upgrading it requires an explicit bump and re-checking the instrumentator's compatibility range") — that's the correct operational note. 5. **MetricsPathFilter is uvicorn-only** → documented in `docs/OBSERVABILITY.md` § Internal-only endpoints: > Promtail/Loki still receive access lines from any other source. Treat the filter as console noise-control, not an audit-suppression mechanism. Exactly the operator-facing caveat I asked for. Future-me will not chase a "missing log" ghost in Loki. ### New issues from the fix commits None operationally. Spot-checks I did: - `prometheus.yml` diff is the cleanest possible (just removes the TODO block, scrape config unchanged) — no risk of breaking the existing backend scrape. - The `MetricsPathFilter` `addFilter` call is at module scope (`main.py`, after `Instrumentator(...).expose(app)`) — runs once per process, no double-registration on reloads. - `excluded_handlers=["/health", "/metrics"]` on the Instrumentator stops `/health` and `/metrics` themselves from polluting `http_requests_total` — would have been a silly cardinality leak if missed. - The two follow-up issues #655 (pre-existing ASGITransport lifespan bug) and #656 (instrumentation overhead NFR) are the right scope and the right labels. #656's "5 ms budget, calibrate first" is the right way to introduce a perf NFR — don't lock in a number you haven't measured. ### One small forward note When [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654) lands, add `rule_files: [rules/*.yml]` to `prometheus.yml` and a `docker-compose.observability.yml` volume mount for `infra/observability/prometheus/rules/`. The current `prometheus.yml` does not mount a rules directory — that change is the prerequisite for #654, not part of it. Ship it.
Author
Owner

🔁 Cycle 2 re-review — Nora "NullX" Steiner

Verdict: Approved

Both concerns from my cycle-1 review are resolved. The new test, the new comment, and the new documentation each pin down one piece of the threat model that was implicit before. This is the right way to ship security-adjacent code: not "is it currently safe?" but "is the next maintainer going to know what made it safe?"

Cycle-1 concerns — resolved

  1. Unauthenticated /metrics (CWE-200 information disclosure) → addressed at three layers, exactly the defense-in-depth I asked for:

    Code (ocr-service/main.py:121-125):

    # /metrics is unauthenticated — relies on Docker-internal-network exposure
    # only (CWE-200 risk if `ports:` ever maps 8000 to host). See
    # docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy block snippet.
    Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app)
    

    The comment names the CWE explicitly, names the trigger condition (ports: mapping), and cross-references the operational mitigation. This is the "comments explain the threat model" pattern I keep asking for. Good.

    Documentation (docs/OBSERVABILITY.md):

    ocr.example.com {
        @internal_only path /metrics /health
        respond @internal_only 404
        reverse_proxy ocr:8000
    }
    

    This is the right shape — respond 404 rather than respond 403 (deny without confirming existence). Caddy snippet is correct.

    ADR-023 explicitly captures the trade-off under "Negative consequences": "/metrics is exposed unauthenticated and relies on the Docker internal network for confidentiality." Future-me cannot accidentally undo this without reading the ADR.

    The excluded_handlers=["/health", "/metrics"] parameter on Instrumentator(...) also stops these two endpoints from polluting http_requests_total{handler=...} — secondary benefit, prevents an attacker who reaches /metrics from counting their own probing.

  2. MetricsPathFilter defensive against uvicorn format changes → fixed. The new test test_uvicorn_access_log_filter_fails_open_on_short_or_missing_args (test_metrics.py:557-575) covers exactly the case I asked for:

    none_record = _logging.LogRecord(..., args=None, ...)
    short_record = _logging.LogRecord(..., args=("a", "b"), ...)
    assert filt.filter(none_record) is True
    assert filt.filter(short_record) is True
    

    Fail-open behavior is locked in. If uvicorn ever changes to keyword args, the filter forwards the record to the handler — /metrics requests would briefly appear in console logs, but no log lines are silently dropped (which is the audit-trail concern). The docstring on the test explicitly names the invariant: "default-allow records when args is None or shorter than expected." Permanent regression coverage.

New issues from the fix commits

I audited the diff for net-new security surface. Findings:

  • _record_training (main.py:59-79) — accepts a callable, raises through to the caller. No new injection surface. The kind= label value is "recognition"/"segmentation", hard-coded at the call site — not user-controlled. Counter cardinality stays bounded.
  • _observe_block_words (main.py:89-99) — reads w["confidence"] and counts. Cardinality bounded. No PII in any counter or label. The reader of OBSERVABILITY.md is now explicitly warned: "Never label these with PII or document content".
  • MetricsPathFilter._SUPPRESSED_PATHS = {"/metrics", "/health"} — closed set, no user input reaches the filter. Safe.
  • The Instrumentator's auto-instrumented http_requests_total{handler=...} still exposes the endpoint topology (including /train, /train-sender, /segtrain) to anyone who reaches /metrics. That's documented now in the threat-model comment and the Caddy snippet ��� acceptable given the internal-network constraint.

One forward note for #654

When the alerting rules land, please do not add a rule that aggregates by script_type (e.g. sum by (script_type) (rate(ocr_jobs_total[5m]))). script_type comes from user-controlled request input — bounded today by frontend enum, but if a backend bug ever allows pass-through, an attacker could spike cardinality via crafted requests. Aggregate by engine (only kraken/surya) or by no labels.

Ship it.

## 🔁 Cycle 2 re-review — Nora "NullX" Steiner **Verdict: ✅ Approved** Both concerns from [my cycle-1 review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11606) are resolved. The new test, the new comment, and the new documentation each pin down one piece of the threat model that was implicit before. This is the right way to ship security-adjacent code: not "is it currently safe?" but "is the next maintainer going to know what made it safe?" ### Cycle-1 concerns — resolved 1. **Unauthenticated `/metrics` (CWE-200 information disclosure)** → addressed at three layers, exactly the defense-in-depth I asked for: **Code** (`ocr-service/main.py:121-125`): ```python # /metrics is unauthenticated — relies on Docker-internal-network exposure # only (CWE-200 risk if `ports:` ever maps 8000 to host). See # docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy block snippet. Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app) ``` The comment names the CWE explicitly, names the trigger condition (`ports:` mapping), and cross-references the operational mitigation. This is the "comments explain the threat model" pattern I keep asking for. Good. **Documentation** (`docs/OBSERVABILITY.md`): ```caddy ocr.example.com { @internal_only path /metrics /health respond @internal_only 404 reverse_proxy ocr:8000 } ``` This is the right shape — `respond 404` rather than `respond 403` (deny without confirming existence). Caddy snippet is correct. **ADR-023** explicitly captures the trade-off under "Negative consequences": "`/metrics` is exposed unauthenticated and relies on the Docker internal network for confidentiality." Future-me cannot accidentally undo this without reading the ADR. The `excluded_handlers=["/health", "/metrics"]` parameter on `Instrumentator(...)` also stops these two endpoints from polluting `http_requests_total{handler=...}` — secondary benefit, prevents an attacker who reaches `/metrics` from counting their own probing. 2. **`MetricsPathFilter` defensive against uvicorn format changes** → fixed. The new test `test_uvicorn_access_log_filter_fails_open_on_short_or_missing_args` (`test_metrics.py:557-575`) covers exactly the case I asked for: ```python none_record = _logging.LogRecord(..., args=None, ...) short_record = _logging.LogRecord(..., args=("a", "b"), ...) assert filt.filter(none_record) is True assert filt.filter(short_record) is True ``` Fail-open behavior is locked in. If uvicorn ever changes to keyword args, the filter forwards the record to the handler — `/metrics` requests would briefly appear in console logs, but no log lines are silently dropped (which is the audit-trail concern). The docstring on the test explicitly names the invariant: "default-allow records when args is None or shorter than expected." Permanent regression coverage. ### New issues from the fix commits I audited the diff for net-new security surface. Findings: - **`_record_training` (`main.py:59-79`)** — accepts a callable, raises through to the caller. No new injection surface. The `kind=` label value is `"recognition"`/`"segmentation"`, hard-coded at the call site — not user-controlled. Counter cardinality stays bounded. - **`_observe_block_words` (`main.py:89-99`)** — reads `w["confidence"]` and counts. Cardinality bounded. No PII in any counter or label. The reader of OBSERVABILITY.md is now explicitly warned: "**Never label these with PII or document content**". - **`MetricsPathFilter._SUPPRESSED_PATHS = {"/metrics", "/health"}`** — closed set, no user input reaches the filter. Safe. - The Instrumentator's auto-instrumented `http_requests_total{handler=...}` still exposes the endpoint topology (including `/train`, `/train-sender`, `/segtrain`) to anyone who reaches `/metrics`. That's documented now in the threat-model comment and the Caddy snippet ��� acceptable given the internal-network constraint. ### One forward note for #654 When the alerting rules land, please do not add a rule that aggregates by `script_type` (e.g. `sum by (script_type) (rate(ocr_jobs_total[5m]))`). `script_type` comes from user-controlled request input — bounded today by frontend enum, but if a backend bug ever allows pass-through, an attacker could spike cardinality via crafted requests. Aggregate by `engine` (only `kraken`/`surya`) or by no labels. Ship it.
Author
Owner

🔁 Cycle 2 re-review — Sara Holt

Verdict: ⚠️ Approved with concerns

Every cycle-1 coverage gap from my review is filled, and the error-path test was refactored to a narrower seam — exactly what I asked for. The duplication concern is also resolved via the new ocr_client async context manager. One new finding: a timing-dependent assertion that I'd flag as a flake risk.

Cycle-1 concerns — resolved

  1. patch("main.asyncio.to_thread", ...) coincidental coupling in error-path test → fixed. test_ocr_training_runs_total_incremented_with_recognition_error_label (test_metrics.py:344-371) now patches main.subprocess.run with a CompletedProcess(returncode=1, ...) to drive the real _run_training error path. The docstring explicitly names the choice:

    Uses the narrowest available seam — subprocess.run returning a failing CompletedProcess — instead of stubbing the asyncio.to_thread boundary, so the test exercises the real _run_training error path.
    This is what good test refactoring looks like. The test now verifies the actual code path, not the mock topology.

  2. Coverage gaps → all three filled:

    • test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender (test_metrics.py:394-416) — covers /train-sender, asserts kind="recognition", outcome="success". If anyone changes the kind label to "sender", this test catches it.
    • test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy (test_metrics.py:419-441) — covers the if result.get("accuracy") is not None guard. Asserts _value.get() == 0.0 when accuracy is None. The uncovered branch I named in cycle 1 is now covered.
    • test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream (test_metrics.py:464-490) — covers the guided streaming path's histogram observation. count == 2.0 after 2 pages, sum non-negative. The guided path is now under test.
  3. Six tests with duplicated with patch(...) stacks → resolved with the new ocr_client(*, raise_app_exceptions=True) async context manager (test_metrics.py:21-37). It bundles kraken_engine.load_models + load_spell_checker patches, the ASGITransport+AsyncClient, the _models_ready = True flag flip, and the cleanup. The tests that use it are now 5-7 lines shorter each. The non-ocr_client tests (training endpoints) still have their own patch stacks because they need TRAINING_TOKEN + _models_ready + asyncio.to_thread — different concerns. Good judgment on where to stop extracting.

New issues from the fix commits

  1. ⚠️ Timing-dependent assertion is a flake risk (test_metrics.py:493-528). The new test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream is conceptually right — it locks in "engine-only timing, excluding spell-check" — but the assertion bound is tight:

    def slow_correct(text):
        _time.sleep(0.05)
        return text
    ...
    # Spell-check sleeps 0.05s per region × 2 regions = 0.1s; engine work is instantaneous.
    # If timing included spell-check, sum_seconds would be ≥ 0.1s. Allow 30ms slack
    # for scheduler overhead.
    assert sum_seconds < 0.05, ...
    

    The slack stated in the comment is 30ms, but the assertion bound is 50ms. On a loaded CI runner, an asyncio.to_thread-scheduled mock returning "text" is not literally zero time — context switches alone can hit single-digit ms. With two regions and time.monotonic() calls bracketing each, I'd expect 5-15ms of measurement overhead in the happy case. 50ms is probably OK, but I would (a) widen the bound to < 0.07 (twice the stated 30ms slack), or (b) replace the timing assertion with a structural one — e.g. patch time.monotonic and assert the deltas exclude the spell-check call entirely. Pure-timing assertions on CI are the #1 source of flakes in this codebase.

  2. fresh_metrics fixture coexists with the production metrics: OcrMetrics = build_metrics(REGISTRY) import-time call (main.py:42). For tests that don't use the fixture (e.g. test_metrics_endpoint_returns_200, test_metrics_includes_http_request_metrics_after_ocr_call), they hit the global REGISTRY and accumulate state across test runs. Today this works because those tests only assert on the presence of metric names, not on counter values. If anyone later adds a value assertion to one of them, the test will pass solo and fail in the suite, which is the classic state-leak symptom. Worth a # uses global REGISTRY by design comment on those two tests, or — better — use fresh_metrics everywhere counter-touching code paths run.

Verification evidence (#11617)

Adequate for the local-verifiable items. /targets deferral to deploy is fine — I'd want a CI smoke test against the observability stack eventually (the test plan note in #11617 suggests this would close the manual loop), but that's a follow-up for #656's benchmark runbook.

#655 (pre-existing test_startup_logs_warning_when_running_as_root failure) — thanks for filing it. The root cause analysis in the issue body is correct: ASGITransport does not run lifespan by default; the metrics suite routes around it with app.router.lifespan_context(app) and the same pattern fixes the failing test.

TL;DR

One genuine concern (timing flake risk on the spell-check exclusion test). Coverage gaps from cycle 1 closed. The author's test refactoring shows real attention to feedback — error-path test went from "patches the boundary" to "patches subprocess.run," which is the right direction. Ship after the timing assertion bound is widened or replaced.

## 🔁 Cycle 2 re-review — Sara Holt **Verdict: ⚠️ Approved with concerns** Every cycle-1 coverage gap from [my review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11607) is filled, and the error-path test was refactored to a narrower seam — exactly what I asked for. The duplication concern is also resolved via the new `ocr_client` async context manager. One new finding: a timing-dependent assertion that I'd flag as a flake risk. ### Cycle-1 concerns — resolved 1. **`patch("main.asyncio.to_thread", ...)` coincidental coupling in error-path test** → fixed. `test_ocr_training_runs_total_incremented_with_recognition_error_label` (`test_metrics.py:344-371`) now patches `main.subprocess.run` with a `CompletedProcess(returncode=1, ...)` to drive the real `_run_training` error path. The docstring explicitly names the choice: > Uses the narrowest available seam — `subprocess.run` returning a failing CompletedProcess — instead of stubbing the asyncio.to_thread boundary, so the test exercises the real `_run_training` error path. This is what good test refactoring looks like. The test now verifies the actual code path, not the mock topology. 2. **Coverage gaps** → all three filled: - ✅ `test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender` (`test_metrics.py:394-416`) — covers `/train-sender`, asserts `kind="recognition", outcome="success"`. If anyone changes the kind label to `"sender"`, this test catches it. - ✅ `test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy` (`test_metrics.py:419-441`) — covers the `if result.get("accuracy") is not None` guard. Asserts `_value.get() == 0.0` when accuracy is None. The uncovered branch I named in cycle 1 is now covered. - ✅ `test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream` (`test_metrics.py:464-490`) — covers the guided streaming path's histogram observation. `count == 2.0` after 2 pages, sum non-negative. The guided path is now under test. 3. **Six tests with duplicated `with patch(...)` stacks** → resolved with the new `ocr_client(*, raise_app_exceptions=True)` async context manager (`test_metrics.py:21-37`). It bundles `kraken_engine.load_models` + `load_spell_checker` patches, the ASGITransport+AsyncClient, the `_models_ready = True` flag flip, and the cleanup. The tests that use it are now 5-7 lines shorter each. The non-ocr_client tests (training endpoints) still have their own patch stacks because they need `TRAINING_TOKEN` + `_models_ready` + `asyncio.to_thread` — different concerns. Good judgment on where to stop extracting. ### New issues from the fix commits 1. **⚠️ Timing-dependent assertion is a flake risk** (`test_metrics.py:493-528`). The new `test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream` is conceptually right — it locks in "engine-only timing, excluding spell-check" — but the assertion bound is tight: ```python def slow_correct(text): _time.sleep(0.05) return text ... # Spell-check sleeps 0.05s per region × 2 regions = 0.1s; engine work is instantaneous. # If timing included spell-check, sum_seconds would be ≥ 0.1s. Allow 30ms slack # for scheduler overhead. assert sum_seconds < 0.05, ... ``` The slack stated in the comment is 30ms, but the assertion bound is 50ms. On a loaded CI runner, an `asyncio.to_thread`-scheduled mock returning `"text"` is not literally zero time — context switches alone can hit single-digit ms. With two regions and `time.monotonic()` calls bracketing each, I'd expect 5-15ms of measurement overhead in the happy case. 50ms is probably OK, but I would (a) widen the bound to `< 0.07` (twice the stated 30ms slack), or (b) replace the timing assertion with a structural one — e.g. patch `time.monotonic` and assert the deltas exclude the spell-check call entirely. Pure-timing assertions on CI are the #1 source of flakes in this codebase. 2. **`fresh_metrics` fixture coexists with the production `metrics: OcrMetrics = build_metrics(REGISTRY)` import-time call** (`main.py:42`). For tests that don't use the fixture (e.g. `test_metrics_endpoint_returns_200`, `test_metrics_includes_http_request_metrics_after_ocr_call`), they hit the global REGISTRY and accumulate state across test runs. Today this works because those tests only assert on the *presence* of metric names, not on counter values. If anyone later adds a value assertion to one of them, the test will pass solo and fail in the suite, which is the classic state-leak symptom. Worth a `# uses global REGISTRY by design` comment on those two tests, or — better — use `fresh_metrics` everywhere counter-touching code paths run. ### Verification evidence (#11617) Adequate for the local-verifiable items. `/targets` deferral to deploy is fine — I'd want a CI smoke test against the observability stack eventually (the test plan note in [#11617](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11617) suggests this would close the manual loop), but that's a follow-up for [#656](https://git.raddatz.cloud/marcel/familienarchiv/issues/656)'s benchmark runbook. [#655](https://git.raddatz.cloud/marcel/familienarchiv/issues/655) (pre-existing `test_startup_logs_warning_when_running_as_root` failure) — thanks for filing it. The root cause analysis in the issue body is correct: `ASGITransport` does not run lifespan by default; the metrics suite routes around it with `app.router.lifespan_context(app)` and the same pattern fixes the failing test. ### TL;DR One genuine concern (timing flake risk on the spell-check exclusion test). Coverage gaps from cycle 1 closed. The author's test refactoring shows real attention to feedback — error-path test went from "patches the boundary" to "patches `subprocess.run`," which is the right direction. Ship after the timing assertion bound is widened or replaced.
Author
Owner

🔁 Cycle 2 re-review — Leonie Voss

Verdict: Approved — out of scope for UI/UX (unchanged from cycle 1)

Re-confirming my cycle-1 verdict: this PR remains entirely backend-only. The cycle-2 fix commits are all in ocr-service/main.py, ocr-service/metrics.py, ocr-service/test_metrics.py, docs/, and infra/observability/. No SvelteKit, no Tailwind, no Paraglide.

What I checked in cycle 2

  • No new frontend files in any of the 11 fix commits — confirmed by re-reading the file list (docs/GLOSSARY.md, docs/OBSERVABILITY.md, docs/adr/023-...md, docs/architecture/c4/l2-containers.puml, infra/observability/prometheus/prometheus.yml, ocr-service/main.py, ocr-service/metrics.py, ocr-service/requirements.txt, ocr-service/test_metrics.py).
  • No new user-facing error codes that would need i18n keys in messages/{de,en,es}.json. The error message in OBSERVABILITY.md ("Models not loaded yet" → HTTP 503) is operator-facing on /health, not user-facing.
  • No new flows touching the UI. The /metrics endpoint is for Prometheus, not for any human reader.
  • No new strings in transcription, document, or annotation surface code — the OCR domain's only frontend touchpoint is the streaming response shape, which is unchanged by this PR.

Forward note (still applies for #651)

When the Grafana dashboards land via #651 and the alerting via #654, please:

  • Pair status colors with icons and labels — no traffic-light reds alone (color-blind users, ~8% of male operators).
  • Minimum 12px font on every panel; prefer 14px for KPI values.
  • Verify dashboards render at 320px width — we have operators on phones during incidents. The ocr_models_ready and ocr_skipped_pages_total ratios are the two values most likely to be glanced at on a mobile alert page.
  • Use the brand palette (--palette-navy, --palette-mint) where possible; consider a sand background overlay if Grafana lets you set per-dashboard theming.

I'll review #651's dashboard PR when it opens — that's where the UI/UX rigor matters.

Verification evidence (#11617)

Not applicable — verification was correctly scoped to backend / observability surface. No UI to verify.

LGTM from a UX standpoint. No UI surface in this PR.

## 🔁 Cycle 2 re-review — Leonie Voss **Verdict: ✅ Approved — out of scope for UI/UX (unchanged from cycle 1)** Re-confirming [my cycle-1 verdict](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11608): this PR remains entirely backend-only. The cycle-2 fix commits are all in `ocr-service/main.py`, `ocr-service/metrics.py`, `ocr-service/test_metrics.py`, `docs/`, and `infra/observability/`. No SvelteKit, no Tailwind, no Paraglide. ### What I checked in cycle 2 - **No new frontend files** in any of the 11 fix commits — confirmed by re-reading the file list (`docs/GLOSSARY.md`, `docs/OBSERVABILITY.md`, `docs/adr/023-...md`, `docs/architecture/c4/l2-containers.puml`, `infra/observability/prometheus/prometheus.yml`, `ocr-service/main.py`, `ocr-service/metrics.py`, `ocr-service/requirements.txt`, `ocr-service/test_metrics.py`). - **No new user-facing error codes** that would need i18n keys in `messages/{de,en,es}.json`. The error message in OBSERVABILITY.md ("Models not loaded yet" → HTTP 503) is operator-facing on `/health`, not user-facing. - **No new flows touching the UI**. The `/metrics` endpoint is for Prometheus, not for any human reader. - **No new strings in transcription, document, or annotation surface code** — the OCR domain's only frontend touchpoint is the streaming response shape, which is unchanged by this PR. ### Forward note (still applies for #651) When the Grafana dashboards land via [#651](https://git.raddatz.cloud/marcel/familienarchiv/issues/651) and the alerting via [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654), please: - Pair status colors with icons and labels — no traffic-light reds alone (color-blind users, ~8% of male operators). - Minimum 12px font on every panel; prefer 14px for KPI values. - Verify dashboards render at 320px width — we have operators on phones during incidents. The `ocr_models_ready` and `ocr_skipped_pages_total` ratios are the two values most likely to be glanced at on a mobile alert page. - Use the brand palette (`--palette-navy`, `--palette-mint`) where possible; consider a sand background overlay if Grafana lets you set per-dashboard theming. I'll review #651's dashboard PR when it opens — that's where the UI/UX rigor matters. ### Verification evidence (#11617) Not applicable — verification was correctly scoped to backend / observability surface. No UI to verify. LGTM from a UX standpoint. No UI surface in this PR.
Author
Owner

🔁 Cycle 2 re-review — Elicit (Brownfield mode)

Verdict: Approved

All four items from my cycle-1 review are either resolved or filed as a tracked follow-up with proper labels, milestones, and ACs. The traceability from issue #652 → PR #653 → ADR-023 → follow-ups #654/#655/#656 is intact. This is the level of requirements hygiene I want for every feature delivery.

Cycle-1 blockers — resolved

  1. Two unchecked Test plan items → addressed in two correct ways:

    • Locally verifiable items were verified in #11617: GET /metrics returns 200, all nine ocr_* HELP/TYPE banners present, ocr_models_ready 1.0 after lifespan, http_requests_total{handler="/ocr"} after a request. Each claim backed by a $ curl ... transcript. This converts the unchecked items into evidence.
    • The non-local item (http://localhost:9090/targets showing UP) is correctly deferred with a documented reason: the observability stack lives in docker-compose.observability.yml per ADR-009 / ADR-016. The verification will happen on deploy. This is the correct disposition — not "we'll do it later," but "we have a different runbook for this class of check."

    I would still want a one-line acceptance check added to a deploy runbook ("after deploy, verify http://prom.example.com/targets shows ocr-service UP within 30s"), but that's a polish point, not a requirements gap. The PR's own Test plan now reads as a faithful record of what was verified where.

  2. Alerting follow-up issue → filed as #654 with the proper labels (devops + feature + phase-7: monitoring), three concrete rules with for: durations and severity labels, PromQL that uses the metric names actually shipped here, a promtool check rules AC, and cross-references to ADR-023 and PR #653. This is a usable follow-up issue, not a placeholder. The traceability is intact: #652 (parent feature) → #653 (this PR) → #654 (the alert rules that consume #653's metrics) → ADR-023 (the architectural decision that produced both).

Cycle-1 concerns — addressed

  1. NFR coverage on #652:

    • Observability (primary NFR) — met, evidence in PR body.
    • Performance → filed as #656 with the suggested 5ms p95 budget I named, a request to commit the NFR to docs/architecture/NFRs.md, a reproducible load-test script AC, and a stretch goal of nightly CI benchmarking. The issue body explicitly says "calibrate from one baseline + one instrumented run before locking it in" — exactly the right way to introduce a perf NFR. Good.
    • Privacy — documented in docs/OBSERVABILITY.md: "Never label these with PII or document content — labels have unbounded cardinality risk and are visible to anyone with Grafana access." This is the operator-facing invariant I wanted recorded.
    • Reliability — ADR-023 captures the registry-collision concern under "Negative consequences." The fresh-registry test fixture ensures the invariant is exercised on every CI run.
  2. Glossary gap → addressed. docs/GLOSSARY.md now lists four new terms: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. Each entry explains what the term means and which metric exposes it. The two accuracy entries explicitly call out that recognition vs segmentation are independent metrics — operators reading the dashboard will know why both exist. Good.

New issues from the fix commits

I read every doc change for the kind of doc-vs-code contradiction that creeps in during fix loops. Findings:

  • The OBSERVABILITY.md metrics table matches metrics.py's collector definitions exactly (labels, types, units).
  • The ADR-023 claim "prometheus-client==0.25.0 pinned explicitly" is verified in requirements.txt.
  • The L2 PUML arrow Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics") matches the actual scrape config in prometheus.yml (target ocr:8000, path /metrics).
  • The Glossary entry "Models-ready gauge — flipped from 0 to 1 once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker" matches main.py lifespan exactly (loads kraken_engine → loads spell_checker → sets _models_ready = True → sets ocr_models_ready.set(1)).
  • The Glossary entry for the alert "supervised signal for the ocr_models_ready < 1 for 2m alert" cross-references #654's Alert 1 exactly.

No contradictions found.

One small forward note

The Definition of Done for #652 included AC1 ("Prometheus target flips DOWN → UP after image rebuild"). Strictly, that AC is closed only after the deployment-host verification runs. When you tick #652 as done, please reference the deploy log / /targets screenshot in the issue close-comment. Otherwise #652 closes with one paper AC still unverified.

TL;DR

The author treated cycle-1 feedback as a structured backlog, resolved the in-scope items in this PR, and filed three follow-up issues (#654, #655, #656) with proper labels and ACs for the out-of-scope items. The traceability from feature → PR → ADR → follow-ups is intact. The verification evidence is sufficient for what's locally verifiable, with explicit deferral for what isn't.

This is what good requirements delivery looks like. Approved.

## 🔁 Cycle 2 re-review — Elicit (Brownfield mode) **Verdict: ✅ Approved** All four items from [my cycle-1 review](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11610) are either resolved or filed as a tracked follow-up with proper labels, milestones, and ACs. The traceability from issue #652 → PR #653 → ADR-023 → follow-ups #654/#655/#656 is intact. This is the level of requirements hygiene I want for every feature delivery. ### Cycle-1 blockers — resolved 1. **Two unchecked Test plan items** → addressed in two correct ways: - **Locally verifiable items** were verified in [#11617](https://git.raddatz.cloud/marcel/familienarchiv/pulls/653#issuecomment-11617): `GET /metrics` returns 200, all nine `ocr_*` HELP/TYPE banners present, `ocr_models_ready 1.0` after lifespan, `http_requests_total{handler="/ocr"}` after a request. Each claim backed by a `$ curl ...` transcript. This converts the unchecked items into evidence. - **The non-local item** (`http://localhost:9090/targets` showing UP) is correctly deferred with a documented reason: the observability stack lives in `docker-compose.observability.yml` per ADR-009 / ADR-016. The verification will happen on deploy. This is the correct disposition — not "we'll do it later," but "we have a different runbook for this class of check." I would still want a one-line acceptance check added to a deploy runbook ("after deploy, verify `http://prom.example.com/targets` shows ocr-service UP within 30s"), but that's a polish point, not a requirements gap. The PR's own Test plan now reads as a faithful record of what was verified where. 2. **Alerting follow-up issue** → filed as [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654) with the proper labels (`devops` + `feature` + `phase-7: monitoring`), three concrete rules with `for:` durations and severity labels, PromQL that uses the metric names actually shipped here, a `promtool check rules` AC, and cross-references to ADR-023 and PR #653. This is a *usable* follow-up issue, not a placeholder. The traceability is intact: #652 (parent feature) → #653 (this PR) → #654 (the alert rules that consume #653's metrics) → ADR-023 (the architectural decision that produced both). ### Cycle-1 concerns — addressed 3. **NFR coverage on #652**: - **Observability** (primary NFR) — met, evidence in PR body. - **Performance** → filed as [#656](https://git.raddatz.cloud/marcel/familienarchiv/issues/656) with the suggested 5ms p95 budget I named, a request to commit the NFR to `docs/architecture/NFRs.md`, a reproducible load-test script AC, and a stretch goal of nightly CI benchmarking. The issue body explicitly says "calibrate from one baseline + one instrumented run before locking it in" — exactly the right way to introduce a perf NFR. Good. - **Privacy** — documented in `docs/OBSERVABILITY.md`: "**Never label these with PII or document content** — labels have unbounded cardinality risk and are visible to anyone with Grafana access." This is the operator-facing invariant I wanted recorded. - **Reliability** — ADR-023 captures the registry-collision concern under "Negative consequences." The fresh-registry test fixture ensures the invariant is exercised on every CI run. 4. **Glossary gap** → addressed. `docs/GLOSSARY.md` now lists four new terms: Illegible word, Models-ready gauge, Recognition model accuracy, Segmentation model accuracy. Each entry explains *what* the term means *and* which metric exposes it. The two accuracy entries explicitly call out that recognition vs segmentation are independent metrics — operators reading the dashboard will know why both exist. Good. ### New issues from the fix commits I read every doc change for the kind of doc-vs-code contradiction that creeps in during fix loops. Findings: - The OBSERVABILITY.md metrics table matches `metrics.py`'s collector definitions exactly (labels, types, units). - The ADR-023 claim "`prometheus-client==0.25.0` pinned explicitly" is verified in `requirements.txt`. - The L2 PUML arrow `Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")` matches the actual scrape config in `prometheus.yml` (target `ocr:8000`, path `/metrics`). - The Glossary entry "Models-ready gauge — flipped from 0 to 1 once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker" matches `main.py` lifespan exactly (loads kraken_engine → loads spell_checker → sets `_models_ready = True` → sets `ocr_models_ready.set(1)`). - The Glossary entry for the alert "supervised signal for the `ocr_models_ready < 1 for 2m` alert" cross-references [#654](https://git.raddatz.cloud/marcel/familienarchiv/issues/654)'s Alert 1 exactly. No contradictions found. ### One small forward note The Definition of Done for [#652](https://git.raddatz.cloud/marcel/familienarchiv/issues/652) included AC1 ("Prometheus target flips DOWN → UP after image rebuild"). Strictly, that AC is closed only after the deployment-host verification runs. When you tick #652 as done, please reference the deploy log / `/targets` screenshot in the issue close-comment. Otherwise [#652](https://git.raddatz.cloud/marcel/familienarchiv/issues/652) closes with one paper AC still unverified. ### TL;DR The author treated cycle-1 feedback as a structured backlog, resolved the in-scope items in this PR, and filed three follow-up issues (#654, #655, #656) with proper labels and ACs for the out-of-scope items. The traceability from feature → PR → ADR → follow-ups is intact. The verification evidence is sufficient for what's locally verifiable, with explicit deferral for what isn't. This is what good requirements delivery looks like. Approved.
marcel added 2 commits 2026-05-21 17:23:59 +02:00
Sara's cycle-2 S1: the wall-clock assertion at < 0.05s could trip on a
slow CI runner under load even when the timer correctly excludes
spell-check. Sara's preferred structural fix (patch main.time.monotonic
with a deterministic sequence) proved awkward — the patched attribute is
the *global* time.monotonic which httpx and asyncio consume, exhausting
the sequence before the request reaches the engine loop.

Take the documented fallback: widen the bound to 0.09s and explain why.
The failure mode the test guards against (spell-check inside the timer)
would add 0.1s (2 × 0.05s sleep), so 0.09s catches the bug while leaving
~90ms of headroom for slow CI runners. Verified red→green by temporarily
moving correct_text inside the timer block: bound trips at 0.101s; the
fixed code reads ~0.001s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docs(ocr): explain why two metrics tests skip fresh_metrics fixture
Some checks failed
CI / Backend Unit Tests (push) Successful in 3m42s
CI / fail2ban Regex (push) Successful in 43s
CI / Semgrep Security Scan (push) Successful in 19s
CI / Compose Bucket Idempotency (push) Successful in 1m0s
nightly / deploy-staging (push) Successful in 5m43s
CI / Unit & Component Tests (pull_request) Successful in 3m24s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m28s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Unit & Component Tests (push) Failing after 2m44s
CI / OCR Service Tests (push) Successful in 20s
0801da8df0
Sara's cycle-2 S2: clarify the latent (but not actual) cross-test state
risk on the two metrics tests that hit the global REGISTRY instead of
the per-test fresh_metrics fixture. Migrating them would actually break
them — the /metrics endpoint is served by prometheus-fastapi-instrumentator
which binds to the default REGISTRY at app-construction time, and the
http_requests_total assertion only finds counters on that global
registry. Both tests already assert response shape only (status code,
content-type substring, body substrings), not numeric values, so the
shared-registry caveat is documented for future readers rather than
treated as a bug to fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Owner

🔁 Cycle 3 re-review — Sara Holt

Verdict: Approve.

Both cycle-2 concerns are resolved with technical rigor. Full test_metrics.py is green locally (22 passed in 0.56s).


S1 — Timing-flake risk in test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream (commit e0e1578b)

Bound widened from < 0.05s to < 0.09s. I checked the math against the failure mode:

  • Failure floor: correct_text accidentally re-entering the timer would observe 2 × 0.05s = 0.1s of spell-check sleep. The 0.09s ceiling sits below that floor, so the test still trips on the regression it's meant to catch.
  • CI headroom: engine work under mock is instantaneous (~0.001s per the commit message), so the test has ~89ms of slack — comfortably above typical CI scheduler jitter (~50ms) and well below the regression floor (100ms). Goldilocks zone.
  • Fallback rationale is legitimate: the commit message documents why the structural fix (patch("main.time.monotonic") with a deterministic sequence) didn't pan out — time.monotonic is global and httpx/asyncio consume it before the request reaches the engine loop. That's a real characteristic of time.monotonic patching in async stacks; I'm not going to ask anyone to wrestle a structural fix where the seam itself is shared with the transport layer.
  • Docstring is quantitative: "spell-check sleeps 0.05s × 2 regions = 0.1s … 0.09s ceiling catches that bug while leaving ~90ms of slack for slow CI runners". A future contributor reading this knows why 0.09 was chosen and what would invalidate that choice. That's exactly the rationale-comment standard from CODESTYLE.

This is the right pragmatic resolution.

S2 — Two tests on shared REGISTRY without fresh_metrics (commit 0801da8d)

Docstrings added to both test_metrics_endpoint_returns_200 and test_metrics_includes_http_request_metrics_after_ocr_call. I checked both explanations against the trap they're guarding:

  • Why fresh_metrics would break them: prometheus-fastapi-instrumentator binds the /metrics route to the default REGISTRY at app-construction time. Swapping main.metrics via the fixture would not redirect what /metrics actually exposes. The http_requests_total / http_request_duration_seconds counters live on the instrumentator's default REGISTRY — a fresh CollectorRegistry would never see them. Both points are stated explicitly in the docstrings.
  • Why shared-registry use is safe: both tests "only assert response shape (status code + content-type substring / substring presence in the exposition body), not numeric counter values, so cross-test state leakage cannot affect it." Stated verbatim in both docstrings.
  • Trap-prevention quality: the wording "not numeric counter values" is load-bearing. A future contributor adding assert "http_requests_total 5.0" in body would have to ignore that explicit sentence. That's about as good as a docstring can do.

Migrating these to fresh_metrics was the obvious-looking fix, but it would have introduced a defect (counter assertions silently passing against an empty registry). Documenting the design intent is the right call here.


Scope discipline

I checked the two fix commits did not regress anything else in scope of the original PR — they only touch docstrings + one bound constant. No collateral damage.

Approve. Ship it.

— Sara Holt

## 🔁 Cycle 3 re-review — Sara Holt **Verdict: Approve.** Both cycle-2 concerns are resolved with technical rigor. Full `test_metrics.py` is green locally (22 passed in 0.56s). --- ### S1 — Timing-flake risk in `test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream` (commit `e0e1578b`) Bound widened from `< 0.05s` to `< 0.09s`. I checked the math against the failure mode: - **Failure floor:** `correct_text` accidentally re-entering the timer would observe `2 × 0.05s = 0.1s` of spell-check sleep. The `0.09s` ceiling sits *below* that floor, so the test still trips on the regression it's meant to catch. - **CI headroom:** engine work under mock is instantaneous (~0.001s per the commit message), so the test has ~89ms of slack — comfortably above typical CI scheduler jitter (~50ms) and well below the regression floor (100ms). Goldilocks zone. - **Fallback rationale is legitimate:** the commit message documents why the structural fix (`patch("main.time.monotonic")` with a deterministic sequence) didn't pan out — `time.monotonic` is global and httpx/asyncio consume it before the request reaches the engine loop. That's a real characteristic of `time.monotonic` patching in async stacks; I'm not going to ask anyone to wrestle a structural fix where the seam itself is shared with the transport layer. - **Docstring is quantitative:** "spell-check sleeps 0.05s × 2 regions = 0.1s … 0.09s ceiling catches that bug while leaving ~90ms of slack for slow CI runners". A future contributor reading this knows *why* 0.09 was chosen and what would invalidate that choice. That's exactly the rationale-comment standard from CODESTYLE. This is the right pragmatic resolution. ### S2 — Two tests on shared REGISTRY without `fresh_metrics` (commit `0801da8d`) Docstrings added to both `test_metrics_endpoint_returns_200` and `test_metrics_includes_http_request_metrics_after_ocr_call`. I checked both explanations against the trap they're guarding: - **Why `fresh_metrics` would break them:** `prometheus-fastapi-instrumentator` binds the `/metrics` route to the default REGISTRY at app-construction time. Swapping `main.metrics` via the fixture would not redirect what `/metrics` actually exposes. The `http_requests_total` / `http_request_duration_seconds` counters live on the instrumentator's default REGISTRY — a fresh `CollectorRegistry` would never see them. Both points are stated explicitly in the docstrings. - **Why shared-registry use is safe:** both tests "only assert response shape (status code + content-type substring / substring presence in the exposition body), not numeric counter values, so cross-test state leakage cannot affect it." Stated verbatim in both docstrings. - **Trap-prevention quality:** the wording "**not numeric counter values**" is load-bearing. A future contributor adding `assert "http_requests_total 5.0" in body` would have to ignore that explicit sentence. That's about as good as a docstring can do. Migrating these to `fresh_metrics` was the obvious-looking fix, but it would have *introduced* a defect (counter assertions silently passing against an empty registry). Documenting the design intent is the right call here. --- ### Scope discipline I checked the two fix commits did not regress anything else in scope of the original PR — they only touch docstrings + one bound constant. No collateral damage. ✅ **Approve.** Ship it. — Sara Holt
marcel merged commit 0801da8df0 into main 2026-05-21 18:16:48 +02:00
marcel deleted branch feat/issue-652-ocr-metrics 2026-05-21 18:16:48 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#653