As a developer I want the OCR service to expose a /metrics endpoint so Prometheus can scrape OCR throughput, error rate, and model quality #652

New Issue

marcel · 2026-05-21T15:15:37+02:00

marcel commented

2026-05-21 15:15:37 +02:00

Context

prometheus.yml already declares a job_name: ocr-service scrape target at ocr:8000/metrics, but the endpoint does not exist — the target shows as DOWN in Prometheus. No code changes to the observability stack are needed; only the OCR service itself needs to be updated.

The OCR service is a FastAPI (Python) app. Adding metrics requires three changes:

One new dependency in requirements.txt
Two lines to mount the /metrics endpoint in main.py
Manual counters/gauges at the relevant call sites already in the code

User Story

As a developer, I want the OCR service to expose standard Prometheus metrics, so that I can monitor OCR throughput, page error rate, model quality, and training activity in Grafana without having to grep logs.

Acceptance Criteria

AC1 — `/metrics` endpoint exists and is scraped

Given the observability stack is running,
When Prometheus scrapes ocr:8000/metrics,
Then the ocr-service target shows as UP in Prometheus (currently DOWN).

Given a GET /ocr or POST /ocr/stream request is handled,
When /metrics is scraped,
Then http_requests_total and http_request_duration_seconds are present for those endpoints (auto-instrumented by prometheus-fastapi-instrumentator).

AC2 — OCR job counters

Given POST /ocr or POST /ocr/stream completes (success or error),
When /metrics is scraped,
Then ocr_jobs_total is incremented, labelled by:

engine: kraken or surya
script_type: HANDWRITING_KURRENT, HANDWRITING_LATIN, TYPEWRITER, UNKNOWN

AC3 — Page-level counters

Given each page is processed inside the streaming generator,
When /metrics is scraped,
Then:

ocr_pages_total (labelled engine) is incremented per successfully processed page
ocr_skipped_pages_total is incremented when a page raises an exception (skipped_pages += 1 in both generate() and generate_guided())

AC4 — Confidence / quality counters

Given apply_confidence_markers() processes a word list,
When /metrics is scraped,
Then:

ocr_words_total is incremented by the number of words in the block
ocr_illegible_words_total is incremented by the number of words below the confidence threshold (those replaced with [unleserlich])

This ratio (ocr_illegible_words_total / ocr_words_total) is the primary OCR quality signal.

AC5 — Processing duration

Given the OCR engine runs in asyncio.to_thread(engine.extract_blocks, ...),
When /metrics is scraped,
Then ocr_processing_seconds (Histogram, labelled engine) captures per-document processing time — measured around the thread call in run_ocr and per-page in run_ocr_stream.

AC6 — Training metrics

Given POST /train or POST /train-sender completes,
When /metrics is scraped,
Then:

ocr_training_runs_total (labelled outcome: success / error) is incremented
ocr_model_accuracy (Gauge) is set to the accuracy value returned by _parse_best_checkpoint() after a successful run

AC7 — Model ready gauge

Given the lifespan startup completes and _models_ready = True,
When /metrics is scraped,
Then ocr_models_ready (Gauge) is 1.0. Before startup completes it is 0.0.

Implementation Notes

Dependency

Add to requirements.txt:

prometheus-fastapi-instrumentator==7.0.0

(prometheus-client is a transitive dependency — no separate entry needed.)

Mounting the endpoint

In main.py, directly after app = FastAPI(...):

from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

Custom metric declarations

Declare module-level in main.py (import Counter, Histogram, Gauge from prometheus_client):

ocr_jobs_total          = Counter("ocr_jobs_total", "...", ["engine", "script_type"])
ocr_pages_total         = Counter("ocr_pages_total", "...", ["engine"])
ocr_skipped_pages_total = Counter("ocr_skipped_pages_total", "...")
ocr_words_total         = Counter("ocr_words_total", "...")
ocr_illegible_words_total = Counter("ocr_illegible_words_total", "...")
ocr_processing_seconds  = Histogram("ocr_processing_seconds", "...", ["engine"])
ocr_training_runs_total = Counter("ocr_training_runs_total", "...", ["outcome"])
ocr_model_accuracy      = Gauge("ocr_model_accuracy", "...")
ocr_models_ready        = Gauge("ocr_models_ready", "...")

Increment these at the existing call sites — no structural changes to the request handling logic.

`apply_confidence_markers()` — word counting

confidence.py already iterates words; the counter increments can be added there or at the call site in main.py (caller has access to block["words"] before apply_confidence_markers strips them).

Non-Functional Requirements

NFR-PERF-01: Scraping /metrics must not add measurable latency to OCR requests — prometheus_fastapi_instrumentator uses an async middleware, this is safe.
NFR-OPS-01: No changes to prometheus.yml or docker-compose.observability.yml are needed — the scrape target is already configured.
NFR-TEST-01: The new /metrics endpoint must return HTTP 200 and include ocr_jobs_total after at least one OCR request in the existing test_main.py integration tests (or a new dedicated test).

Out of Scope

Grafana dashboard panels for OCR metrics — tracked in #651
Per-sender model accuracy tracking
Alerting rules based on OCR error rate

## Context `prometheus.yml` already declares a `job_name: ocr-service` scrape target at `ocr:8000/metrics`, but the endpoint does not exist — the target shows as **DOWN** in Prometheus. No code changes to the observability stack are needed; only the OCR service itself needs to be updated. The OCR service is a FastAPI (Python) app. Adding metrics requires three changes: 1. One new dependency in `requirements.txt` 2. Two lines to mount the `/metrics` endpoint in `main.py` 3. Manual counters/gauges at the relevant call sites already in the code --- ## User Story > As a developer, I want the OCR service to expose standard Prometheus metrics, so that I can monitor OCR throughput, page error rate, model quality, and training activity in Grafana without having to grep logs. --- ## Acceptance Criteria ### AC1 — `/metrics` endpoint exists and is scraped **Given** the observability stack is running, **When** Prometheus scrapes `ocr:8000/metrics`, **Then** the `ocr-service` target shows as **UP** in Prometheus (currently DOWN). **Given** a `GET /ocr` or `POST /ocr/stream` request is handled, **When** `/metrics` is scraped, **Then** `http_requests_total` and `http_request_duration_seconds` are present for those endpoints (auto-instrumented by `prometheus-fastapi-instrumentator`). ### AC2 — OCR job counters **Given** `POST /ocr` or `POST /ocr/stream` completes (success or error), **When** `/metrics` is scraped, **Then** `ocr_jobs_total` is incremented, labelled by: - `engine`: `kraken` or `surya` - `script_type`: `HANDWRITING_KURRENT`, `HANDWRITING_LATIN`, `TYPEWRITER`, `UNKNOWN` ### AC3 — Page-level counters **Given** each page is processed inside the streaming generator, **When** `/metrics` is scraped, **Then**: - `ocr_pages_total` (labelled `engine`) is incremented per successfully processed page - `ocr_skipped_pages_total` is incremented when a page raises an exception (`skipped_pages += 1` in both `generate()` and `generate_guided()`) ### AC4 — Confidence / quality counters **Given** `apply_confidence_markers()` processes a word list, **When** `/metrics` is scraped, **Then**: - `ocr_words_total` is incremented by the number of words in the block - `ocr_illegible_words_total` is incremented by the number of words below the confidence threshold (those replaced with `[unleserlich]`) This ratio (`ocr_illegible_words_total / ocr_words_total`) is the primary OCR quality signal. ### AC5 — Processing duration **Given** the OCR engine runs in `asyncio.to_thread(engine.extract_blocks, ...)`, **When** `/metrics` is scraped, **Then** `ocr_processing_seconds` (Histogram, labelled `engine`) captures per-document processing time — measured around the thread call in `run_ocr` and per-page in `run_ocr_stream`. ### AC6 — Training metrics **Given** `POST /train` or `POST /train-sender` completes, **When** `/metrics` is scraped, **Then**: - `ocr_training_runs_total` (labelled `outcome`: `success` / `error`) is incremented - `ocr_model_accuracy` (Gauge) is set to the `accuracy` value returned by `_parse_best_checkpoint()` after a successful run ### AC7 — Model ready gauge **Given** the lifespan startup completes and `_models_ready = True`, **When** `/metrics` is scraped, **Then** `ocr_models_ready` (Gauge) is `1.0`. Before startup completes it is `0.0`. --- ## Implementation Notes ### Dependency Add to `requirements.txt`: ``` prometheus-fastapi-instrumentator==7.0.0 ``` (`prometheus-client` is a transitive dependency — no separate entry needed.) ### Mounting the endpoint In `main.py`, directly after `app = FastAPI(...)`: ```python from prometheus_fastapi_instrumentator import Instrumentator Instrumentator().instrument(app).expose(app) ``` ### Custom metric declarations Declare module-level in `main.py` (import `Counter`, `Histogram`, `Gauge` from `prometheus_client`): ```python ocr_jobs_total = Counter("ocr_jobs_total", "...", ["engine", "script_type"]) ocr_pages_total = Counter("ocr_pages_total", "...", ["engine"]) ocr_skipped_pages_total = Counter("ocr_skipped_pages_total", "...") ocr_words_total = Counter("ocr_words_total", "...") ocr_illegible_words_total = Counter("ocr_illegible_words_total", "...") ocr_processing_seconds = Histogram("ocr_processing_seconds", "...", ["engine"]) ocr_training_runs_total = Counter("ocr_training_runs_total", "...", ["outcome"]) ocr_model_accuracy = Gauge("ocr_model_accuracy", "...") ocr_models_ready = Gauge("ocr_models_ready", "...") ``` Increment these at the existing call sites — no structural changes to the request handling logic. ### `apply_confidence_markers()` — word counting `confidence.py` already iterates `words`; the counter increments can be added there or at the call site in `main.py` (caller has access to `block["words"]` before `apply_confidence_markers` strips them). --- ## Non-Functional Requirements - **NFR-PERF-01:** Scraping `/metrics` must not add measurable latency to OCR requests — `prometheus_fastapi_instrumentator` uses an async middleware, this is safe. - **NFR-OPS-01:** No changes to `prometheus.yml` or `docker-compose.observability.yml` are needed — the scrape target is already configured. - **NFR-TEST-01:** The new `/metrics` endpoint must return HTTP 200 and include `ocr_jobs_total` after at least one OCR request in the existing `test_main.py` integration tests (or a new dedicated test). --- ## Out of Scope - Grafana dashboard panels for OCR metrics — tracked in #651 - Per-sender model accuracy tracking - Alerting rules based on OCR error rate

marcel added the P2-medium devops feature phase-7: monitoring labels 2026-05-21 15:15:41 +02:00

marcel referenced this issue

2026-05-21 15:17:12 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel commented

2026-05-21 15:17:22 +02:00

Once this is merged, #651 (PO overview dashboard) can be extended with Row 4 — OCR Health using the metrics exposed here. The four panels planned for that row are:

Panel	Query
OCR jobs this week	`sum(increase(ocr_jobs_total[$__range]))`
OCR page error rate	`sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))`
Illegible word rate	`sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))`
OCR service status	`ocr_models_ready`

Once this is merged, #651 (PO overview dashboard) can be extended with **Row 4 — OCR Health** using the metrics exposed here. The four panels planned for that row are: | Panel | Query | |-------|-------| | OCR jobs this week | `sum(increase(ocr_jobs_total[$__range]))` | | OCR page error rate | `sum(increase(ocr_skipped_pages_total[$__range])) / sum(increase(ocr_pages_total[$__range]))` | | Illegible word rate | `sum(increase(ocr_illegible_words_total[$__range])) / sum(increase(ocr_words_total[$__range]))` | | OCR service status | `ocr_models_ready` |

marcel commented

2026-05-21 15:19:21 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

The instrumentation mount (Instrumentator().instrument(app).expose(app)) must go after app = FastAPI(...) but before the lifespan runs. Looking at main.py, the app object is created at module level and the lifespan is passed to the constructor — this order is correct, but the two-liner must be inserted between line 73 (app = FastAPI(...)) and @app.get("/health"). Inserting it anywhere else may cause the /metrics endpoint to be registered after startup, which is fine for Prometheus scrapes but worth being explicit about.
apply_confidence_markers() in confidence.py iterates words internally. The issue proposes counting ocr_words_total and ocr_illegible_words_total either inside confidence.py or at the call site in main.py where block["words"] is still present. The call-site approach in main.py is cleaner: the caller has access to block["words"] before apply_confidence_markers strips them, and confidence.py remains a pure function with no side effects. Recommendation: keep counters in main.py at the call site.
The two generator functions (generate() and generate_guided()) are closures inside run_ocr_stream. Module-level counters are accessible in closures — no scope issue. The ocr_skipped_pages_total increment should go in both except blocks (lines 259 and 205 of main.py). The issue already calls this out correctly.
Type hints: new Counter, Histogram, Gauge objects at module level should be typed (e.g., ocr_jobs_total: Counter) for IDE support and static analysis clarity.

Recommendations

Declare metric objects with type annotations at module level: ocr_jobs_total: Counter = Counter(...). This follows the existing _models_ready: bool and ALLOWED_PDF_HOSTS: set pattern.
The ocr_processing_seconds Histogram should wrap the asyncio.to_thread(...) call in run_ocr (non-streaming) and each page's asyncio.to_thread(...) call in run_ocr_stream. Use a context manager: with ocr_processing_seconds.labels(engine=engine_name).time(): ... — but since asyncio.to_thread is await-ed, use time() around the awaited call or capture start = time.monotonic() before and .observe() after.
Write the NFR-TEST-01 test first (red), then add instrumentation (green). The test should use AsyncClient(ASGITransport(app=app)) — consistent with existing test_main.py pattern — and call /ocr with a mocked PDF download and engine, then assert /metrics returns 200 and contains ocr_jobs_total.
Name the test descriptively: test_metrics_endpoint_returns_200_and_includes_ocr_jobs_total_after_ocr_request.

Open Decisions

Counter placement for ocr_words_total: call site in main.py vs. inside confidence.py. Call site is cleaner (pure function stays pure), but confidence.py is the only place that sees every word. The issue correctly identifies both options. Call site wins on principle — recommend it, but either works.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - The instrumentation mount (`Instrumentator().instrument(app).expose(app)`) must go **after** `app = FastAPI(...)` but **before** the lifespan runs. Looking at `main.py`, the `app` object is created at module level and the lifespan is passed to the constructor — this order is correct, but the two-liner must be inserted between line 73 (`app = FastAPI(...)`) and `@app.get("/health")`. Inserting it anywhere else may cause the `/metrics` endpoint to be registered after startup, which is fine for Prometheus scrapes but worth being explicit about. - `apply_confidence_markers()` in `confidence.py` iterates `words` internally. The issue proposes counting `ocr_words_total` and `ocr_illegible_words_total` either inside `confidence.py` or at the call site in `main.py` where `block["words"]` is still present. The call-site approach in `main.py` is cleaner: the caller has access to `block["words"]` before `apply_confidence_markers` strips them, and `confidence.py` remains a pure function with no side effects. **Recommendation: keep counters in `main.py` at the call site.** - The two generator functions (`generate()` and `generate_guided()`) are closures inside `run_ocr_stream`. Module-level counters are accessible in closures — no scope issue. The `ocr_skipped_pages_total` increment should go in **both** `except` blocks (lines 259 and 205 of `main.py`). The issue already calls this out correctly. - Type hints: new `Counter`, `Histogram`, `Gauge` objects at module level should be typed (e.g., `ocr_jobs_total: Counter`) for IDE support and static analysis clarity. ### Recommendations - Declare metric objects with type annotations at module level: `ocr_jobs_total: Counter = Counter(...)`. This follows the existing `_models_ready: bool` and `ALLOWED_PDF_HOSTS: set` pattern. - The `ocr_processing_seconds` Histogram should wrap the `asyncio.to_thread(...)` call in `run_ocr` (non-streaming) and each page's `asyncio.to_thread(...)` call in `run_ocr_stream`. Use a context manager: `with ocr_processing_seconds.labels(engine=engine_name).time(): ...` — but since `asyncio.to_thread` is `await`-ed, use `time()` around the awaited call or capture `start = time.monotonic()` before and `.observe()` after. - Write the NFR-TEST-01 test first (red), then add instrumentation (green). The test should use `AsyncClient(ASGITransport(app=app))` — consistent with existing `test_main.py` pattern — and call `/ocr` with a mocked PDF download and engine, then assert `/metrics` returns 200 and contains `ocr_jobs_total`. - Name the test descriptively: `test_metrics_endpoint_returns_200_and_includes_ocr_jobs_total_after_ocr_request`. ### Open Decisions - **Counter placement for `ocr_words_total`**: call site in `main.py` vs. inside `confidence.py`. Call site is cleaner (pure function stays pure), but `confidence.py` is the only place that sees every word. The issue correctly identifies both options. Call site wins on principle — recommend it, but either works.

marcel commented

2026-05-21 15:19:37 +02:00

🏗️ Markus Keller — Application Architect

Observations

This is a self-contained change to the Python microservice with no Spring Boot side. The justification for the OCR service as a separate container already exists (different resource requirements, different deployment cadence). Adding observability to it is correct and well-scoped.
The issue correctly identifies that no changes to prometheus.yml or docker-compose.observability.yml are needed — the scrape target ocr:8000/metrics is already configured. This is the right starting point: fix the thing that's broken (the service not exposing /metrics), not the observer.
prometheus-fastapi-instrumentator==7.0.0 pins the version — good. prometheus-client is correctly identified as a transitive dependency. No separate requirements.txt entry needed.
The C4 container diagram (docs/architecture/c4/l2-containers.puml) should already show the ocr-service container. No new infrastructure component is being added — the /metrics endpoint is a new capability on an existing container, not a new service. This does not require a diagram update, which is correct.
One potential concern: Instrumentator().instrument(app).expose(app) auto-instruments all FastAPI routes. This will also instrument /train, /train-sender, and /segtrain with HTTP duration metrics. That's fine and useful, but worth knowing — training requests can take minutes and will appear as outliers in http_request_duration_seconds. This is expected behavior, not a bug.

Recommendations

The /metrics endpoint exposed by prometheus-fastapi-instrumentator defaults to unauthenticated access on the same port (8000). Since ocr:8000 is on the internal Docker network and not exposed via Caddy, this is acceptable. No auth needed on /metrics for internal scraping.
The ocr_models_ready Gauge (AC7) is a clean health signal for Prometheus alerting rules — more useful than /health for time-series. Good addition.
The ocr_processing_seconds Histogram (AC5) is the most valuable metric here — it answers "is OCR getting slower over time?" which grepping logs cannot. Make sure the label is engine only (not script_type) to keep cardinality low; the issue spec has this right.
No ADR is needed — this is instrumentation of an existing service, not an architectural decision with lasting structural consequences.

Open Decisions

None. This is a well-bounded instrumentation task with a clear implementation path.

## 🏗️ Markus Keller — Application Architect ### Observations - This is a self-contained change to the Python microservice with no Spring Boot side. The justification for the OCR service as a separate container already exists (different resource requirements, different deployment cadence). Adding observability to it is correct and well-scoped. - The issue correctly identifies that **no changes to `prometheus.yml` or `docker-compose.observability.yml` are needed** — the scrape target `ocr:8000/metrics` is already configured. This is the right starting point: fix the thing that's broken (the service not exposing `/metrics`), not the observer. - `prometheus-fastapi-instrumentator==7.0.0` pins the version — good. `prometheus-client` is correctly identified as a transitive dependency. No separate `requirements.txt` entry needed. - The C4 container diagram (`docs/architecture/c4/l2-containers.puml`) should already show the `ocr-service` container. No new infrastructure component is being added — the `/metrics` endpoint is a new capability on an existing container, not a new service. This does **not** require a diagram update, which is correct. - **One potential concern**: `Instrumentator().instrument(app).expose(app)` auto-instruments all FastAPI routes. This will also instrument `/train`, `/train-sender`, and `/segtrain` with HTTP duration metrics. That's fine and useful, but worth knowing — training requests can take minutes and will appear as outliers in `http_request_duration_seconds`. This is expected behavior, not a bug. ### Recommendations - The `/metrics` endpoint exposed by `prometheus-fastapi-instrumentator` defaults to unauthenticated access on the same port (8000). Since `ocr:8000` is on the internal Docker network and not exposed via Caddy, this is acceptable. No auth needed on `/metrics` for internal scraping. - The `ocr_models_ready` Gauge (AC7) is a clean health signal for Prometheus alerting rules — more useful than `/health` for time-series. Good addition. - The `ocr_processing_seconds` Histogram (AC5) is the most valuable metric here — it answers "is OCR getting slower over time?" which grepping logs cannot. Make sure the label is `engine` only (not `script_type`) to keep cardinality low; the issue spec has this right. - No ADR is needed — this is instrumentation of an existing service, not an architectural decision with lasting structural consequences. ### Open Decisions _None. This is a well-bounded instrumentation task with a clear implementation path._

marcel referenced this issue

2026-05-21 15:19:42 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel commented

2026-05-21 15:19:51 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Unauthenticated /metrics endpoint: prometheus-fastapi-instrumentator exposes /metrics without authentication by default. On this stack, ocr:8000 is on the internal Docker network and is not routed through Caddy to the public internet. That makes unauthenticated access acceptable — only containers on the familienarchiv_default network can reach it. No action needed, but this should be confirmed before any future port exposure.
No new attack surface from counters: the custom Counter, Histogram, and Gauge metrics are write-only from application code. They expose aggregate numeric data (counts, durations, accuracy scores) — no user input, no PII, no document content. The /metrics output is safe to expose internally.
ocr_model_accuracy Gauge (AC6): this value comes from _parse_best_checkpoint(), which parses filenames from a training temp directory. The parsed float is bounded by the regex [0-9.]+ and cast to float(). No injection vector here.
Training token: the existing TRAINING_TOKEN check on /train and /train-sender is unchanged. The new metrics will show ocr_training_runs_total but reveal only outcome counts, not training data or model weights. No security regression.
Label cardinality: the engine label takes values kraken or surya; script_type takes one of four known values; outcome takes success or error. All are application-controlled constants, not user-supplied strings. No label injection risk.

Recommendations

Confirm via a Compose networks: audit that ocr:8000 has no ports: mapping in the production Compose file. If the port is ever mapped to the host, add basic auth to the /metrics endpoint using Instrumentator(..., should_respect_env_var=True) with an env-var guard.
The /metrics endpoint should be excluded from the access log to avoid noise. FastAPI/Uvicorn logs every request by default — prometheus_fastapi_instrumentator does not suppress this. Consider a log filter in main.py: logging.getLogger("uvicorn.access").addFilter(MetricsPathFilter()).

Open Decisions

None. The threat model for this change is low-risk given the internal-network-only exposure.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations - **Unauthenticated `/metrics` endpoint**: `prometheus-fastapi-instrumentator` exposes `/metrics` without authentication by default. On this stack, `ocr:8000` is on the internal Docker network and is **not** routed through Caddy to the public internet. That makes unauthenticated access acceptable — only containers on the `familienarchiv_default` network can reach it. No action needed, but this should be confirmed before any future port exposure. - **No new attack surface from counters**: the custom `Counter`, `Histogram`, and `Gauge` metrics are write-only from application code. They expose aggregate numeric data (counts, durations, accuracy scores) — no user input, no PII, no document content. The `/metrics` output is safe to expose internally. - **`ocr_model_accuracy` Gauge (AC6)**: this value comes from `_parse_best_checkpoint()`, which parses filenames from a training temp directory. The parsed float is bounded by the regex `[0-9.]+` and cast to `float()`. No injection vector here. - **Training token**: the existing `TRAINING_TOKEN` check on `/train` and `/train-sender` is unchanged. The new metrics will show `ocr_training_runs_total` but reveal only outcome counts, not training data or model weights. No security regression. - **Label cardinality**: the `engine` label takes values `kraken` or `surya`; `script_type` takes one of four known values; `outcome` takes `success` or `error`. All are application-controlled constants, not user-supplied strings. No label injection risk. ### Recommendations - Confirm via a Compose `networks:` audit that `ocr:8000` has no `ports:` mapping in the production Compose file. If the port is ever mapped to the host, add basic auth to the `/metrics` endpoint using `Instrumentator(..., should_respect_env_var=True)` with an env-var guard. - The `/metrics` endpoint should be excluded from the access log to avoid noise. FastAPI/Uvicorn logs every request by default — `prometheus_fastapi_instrumentator` does not suppress this. Consider a log filter in `main.py`: `logging.getLogger("uvicorn.access").addFilter(MetricsPathFilter())`. ### Open Decisions _None. The threat model for this change is low-risk given the internal-network-only exposure._

marcel commented

2026-05-21 15:20:09 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Observations

NFR-TEST-01 is under-specified: "return HTTP 200 and include ocr_jobs_total after at least one OCR request" is a start, but the test strategy should cover more than the happy path. See recommendations below.
The existing test_main.py uses AsyncClient(ASGITransport(app=app)) with patch() for heavy dependencies — this is the right pattern and should be followed for the new metrics tests.
Missing counter tests: the ACs specify 7+ distinct metric increments across multiple code paths (run_ocr, run_ocr_stream, generate_guided, apply_confidence_markers call sites, /train, lifespan). Each one needs at least one test that proves the increment actually fires.
Counter isolation between tests: Prometheus counters are module-level global singletons in prometheus_client. In Python tests, counters accumulate across test runs in the same process. This is a flakiness risk: if test_A increments ocr_jobs_total and test_B asserts on its value, the assertion depends on run order. Mitigation: use REGISTRY.unregister() + re-register in a pytest.fixture with autouse=True, or assert on _value.get() relative increments rather than absolute values.
AC7 (ocr_models_ready): the lifespan test pattern in test_main.py already patches kraken_engine.load_models and load_spell_checker. The new test should assert ocr_models_ready._value.get() == 1.0 after the lifespan context completes.

Recommendations

Write one test per AC, not one test for all of them. Suggested test names:
- test_metrics_endpoint_returns_200
- test_ocr_jobs_total_incremented_after_successful_ocr
- test_ocr_jobs_total_incremented_with_correct_engine_and_script_type_labels
- test_ocr_pages_total_incremented_per_page_in_stream
- test_ocr_skipped_pages_total_incremented_on_page_exception
- test_ocr_words_and_illegible_words_total_reflect_confidence_markers
- test_ocr_processing_seconds_histogram_observed_after_ocr
- test_ocr_training_runs_total_incremented_on_success_and_error
- test_ocr_model_accuracy_gauge_set_after_successful_training
- test_ocr_models_ready_gauge_is_1_after_startup
Use a pytest.fixture that calls prometheus_client.REGISTRY.unregister(counter) before each test that asserts counter values, then re-registers — or use the prometheus_client.CollectorRegistry() pattern to create test-scoped registries and pass them to Counter(...).
Add the metrics test file as test_metrics.py (separate from test_main.py) to keep test files focused.
NFR-PERF-01 ("scraping must not add measurable latency") is untestable at the unit level. Accept the library's claim and document it. No test needed.

Open Decisions

Counter isolation strategy: test-scoped registry vs. relative-increment assertions. A test-scoped registry is cleaner but requires passing registry= to every Counter(...) declaration — which means the module-level declarations in main.py would need to accept an injectable registry. Relative-increment assertions are simpler but more fragile. Given this is a solo project, relative-increment assertions are pragmatic enough for now.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Observations - **NFR-TEST-01 is under-specified**: "return HTTP 200 and include `ocr_jobs_total` after at least one OCR request" is a start, but the test strategy should cover more than the happy path. See recommendations below. - The existing `test_main.py` uses `AsyncClient(ASGITransport(app=app))` with `patch()` for heavy dependencies — this is the right pattern and should be followed for the new metrics tests. - **Missing counter tests**: the ACs specify 7+ distinct metric increments across multiple code paths (`run_ocr`, `run_ocr_stream`, `generate_guided`, `apply_confidence_markers` call sites, `/train`, lifespan). Each one needs at least one test that proves the increment actually fires. - **Counter isolation between tests**: Prometheus counters are module-level global singletons in `prometheus_client`. In Python tests, counters accumulate across test runs in the same process. This is a flakiness risk: if `test_A` increments `ocr_jobs_total` and `test_B` asserts on its value, the assertion depends on run order. Mitigation: use `REGISTRY.unregister()` + re-register in a `pytest.fixture` with `autouse=True`, or assert on `_value.get()` relative increments rather than absolute values. - **AC7 (`ocr_models_ready`)**: the lifespan test pattern in `test_main.py` already patches `kraken_engine.load_models` and `load_spell_checker`. The new test should assert `ocr_models_ready._value.get() == 1.0` after the lifespan context completes. ### Recommendations - Write one test per AC, not one test for all of them. Suggested test names: - `test_metrics_endpoint_returns_200` - `test_ocr_jobs_total_incremented_after_successful_ocr` - `test_ocr_jobs_total_incremented_with_correct_engine_and_script_type_labels` - `test_ocr_pages_total_incremented_per_page_in_stream` - `test_ocr_skipped_pages_total_incremented_on_page_exception` - `test_ocr_words_and_illegible_words_total_reflect_confidence_markers` - `test_ocr_processing_seconds_histogram_observed_after_ocr` - `test_ocr_training_runs_total_incremented_on_success_and_error` - `test_ocr_model_accuracy_gauge_set_after_successful_training` - `test_ocr_models_ready_gauge_is_1_after_startup` - Use a `pytest.fixture` that calls `prometheus_client.REGISTRY.unregister(counter)` before each test that asserts counter values, then re-registers — or use the `prometheus_client.CollectorRegistry()` pattern to create test-scoped registries and pass them to `Counter(...)`. - Add the metrics test file as `test_metrics.py` (separate from `test_main.py`) to keep test files focused. - NFR-PERF-01 ("scraping must not add measurable latency") is untestable at the unit level. Accept the library's claim and document it. No test needed. ### Open Decisions - **Counter isolation strategy**: test-scoped registry vs. relative-increment assertions. A test-scoped registry is cleaner but requires passing `registry=` to every `Counter(...)` declaration — which means the module-level declarations in `main.py` would need to accept an injectable registry. Relative-increment assertions are simpler but more fragile. Given this is a solo project, relative-increment assertions are pragmatic enough for now.

marcel referenced this issue

2026-05-21 15:20:24 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel commented

2026-05-21 15:20:26 +02:00

🚀 Tobias Wendt — DevOps & Platform Engineer

Observations

The prometheus.yml scrape target job_name: ocr-service at ocr:8000/metrics is already configured and showing as DOWN. This issue closes that gap without any infra changes — ideal.
prometheus-fastapi-instrumentator==7.0.0 is pinned with an exact version — correct. prometheus-client as a transitive dep is fine; Renovate will pick up both.
The OCR service Dockerfile: adding prometheus-fastapi-instrumentator to requirements.txt means the Docker image must be rebuilt. The prometheus-client wheel is small (~100 KB), but prometheus-fastapi-instrumentator itself may pull additional deps. Check pip show prometheus-fastapi-instrumentator for transitive deps before the PR — the OCR image is already large (~5 GB for Surya/Torch) so this addition is negligible.
Health check interaction: prometheus-fastapi-instrumentator auto-instruments /health too. Prometheus scrapes of /health by the existing healthcheck (Docker Compose test: ["CMD", "curl", "-f", "http://localhost:8000/health"]) will show up in http_requests_total{path="/health"}. This is noise in the metrics but harmless. Consider filtering /health from instrumentation: Instrumentator(excluded_handlers=["/health", "/metrics"]).
ocr_models_ready Gauge: once the service is healthy (models loaded), this Gauge = 1.0. If the service restarts and hasn't finished loading yet, it will briefly be 0.0. This is a useful alerting signal: ocr_models_ready < 1 for > 90s → alert. Pairs well with the Prometheus scrape target status already shown in issue comment.
No changes needed in docker-compose.observability.yml: confirmed by the issue. The ocr-service job in prometheus.yml is already there — just waiting for the endpoint to exist.

Recommendations

After the PR is merged and the image is rebuilt, verify the Prometheus target flips from DOWN to UP in the Prometheus UI (http://localhost:9090/targets) before closing the issue.
Add excluded_handlers=["/health", "/metrics"] to Instrumentator() to keep http_requests_total clean — /health and /metrics are infrastructure noise, not application traffic.
The ocr_models_ready Gauge should be the basis for a Prometheus alerting rule in a future issue: alert: OcrServiceNotReady, expr: ocr_models_ready < 1, for: 2m. Not in scope here, but flag it.
prometheus-fastapi-instrumentator version 7.0.0 should be added to the Renovate config (or verified it's already covered by the requirements.txt pattern) so version bumps create PRs automatically.

Open Decisions

None. The infra side of this issue is already done — the only remaining work is in the Python service.

## 🚀 Tobias Wendt — DevOps & Platform Engineer ### Observations - The `prometheus.yml` scrape target `job_name: ocr-service` at `ocr:8000/metrics` is already configured and showing as DOWN. This issue closes that gap without any infra changes — ideal. - `prometheus-fastapi-instrumentator==7.0.0` is pinned with an exact version — correct. `prometheus-client` as a transitive dep is fine; Renovate will pick up both. - **The OCR service Dockerfile**: adding `prometheus-fastapi-instrumentator` to `requirements.txt` means the Docker image must be rebuilt. The `prometheus-client` wheel is small (~100 KB), but `prometheus-fastapi-instrumentator` itself may pull additional deps. Check `pip show prometheus-fastapi-instrumentator` for transitive deps before the PR — the OCR image is already large (~5 GB for Surya/Torch) so this addition is negligible. - **Health check interaction**: `prometheus-fastapi-instrumentator` auto-instruments `/health` too. Prometheus scrapes of `/health` by the existing healthcheck (Docker Compose `test: ["CMD", "curl", "-f", "http://localhost:8000/health"]`) will show up in `http_requests_total{path="/health"}`. This is noise in the metrics but harmless. Consider filtering `/health` from instrumentation: `Instrumentator(excluded_handlers=["/health", "/metrics"])`. - **`ocr_models_ready` Gauge**: once the service is healthy (models loaded), this Gauge = 1.0. If the service restarts and hasn't finished loading yet, it will briefly be 0.0. This is a useful alerting signal: `ocr_models_ready < 1` for > 90s → alert. Pairs well with the Prometheus scrape target status already shown in issue comment. - **No changes needed in `docker-compose.observability.yml`**: confirmed by the issue. The `ocr-service` job in `prometheus.yml` is already there — just waiting for the endpoint to exist. ### Recommendations - After the PR is merged and the image is rebuilt, verify the Prometheus target flips from DOWN to UP in the Prometheus UI (`http://localhost:9090/targets`) before closing the issue. - Add `excluded_handlers=["/health", "/metrics"]` to `Instrumentator()` to keep `http_requests_total` clean — `/health` and `/metrics` are infrastructure noise, not application traffic. - The `ocr_models_ready` Gauge should be the basis for a Prometheus alerting rule in a future issue: `alert: OcrServiceNotReady, expr: ocr_models_ready < 1, for: 2m`. Not in scope here, but flag it. - `prometheus-fastapi-instrumentator` version `7.0.0` should be added to the Renovate config (or verified it's already covered by the `requirements.txt` pattern) so version bumps create PRs automatically. ### Open Decisions _None. The infra side of this issue is already done — the only remaining work is in the Python service._

marcel commented

2026-05-21 15:20:45 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-structured: user story, 7 ACs with Given/When/Then, implementation notes, and NFRs. This is above-average spec quality for a developer-facing feature.
AC2 label completeness: ocr_jobs_total is labelled by engine and script_type. In main.py, the engine is determined by use_kraken (bool) and mapped to kraken_engine or surya_engine. The engine label value ("kraken" or "surya") needs to be derived from this — the issue doesn't specify exactly which string constant to use. Recommend agreeing on "kraken" and "surya" as the canonical label values (lowercase, matching engine module names).
AC5 scope ambiguity: "per-document processing time in run_ocr" and "per-page in run_ocr_stream" are different granularities. The issue says to measure "around the thread call in run_ocr and per-page in run_ocr_stream". This is correct, but the two measurement points produce incomparable data: one measures full-document duration, the other per-page. The Grafana query for "average OCR duration" needs to know which one to use. Clarification: the per-page histogram in run_ocr_stream is more useful for operational insight. The run_ocr measurement is a bonus.
AC6 training metrics for /segtrain: the issue lists AC6 only for POST /train and POST /train-sender. Looking at main.py, there is a third training endpoint: /segtrain (segmentation model training). Should ocr_training_runs_total also be incremented for /segtrain? The issue is silent on this.
NFR-TEST-01 minimal: "HTTP 200 and includes ocr_jobs_total after at least one OCR request" covers only AC1. ACs 2–7 have no corresponding NFR-TEST entries. This is acceptable for an MVP but leaves significant metric behavior untested.
The ocr_models_ready Gauge (AC7) default value: before startup completes it should be 0.0. Prometheus counters/gauges initialize to 0 by default in prometheus_client, so ocr_models_ready will naturally be 0.0 until explicitly set to 1.0 in the lifespan. This should be called out in the implementation note.

Recommendations

Add one line to the implementation notes: "The ocr_models_ready Gauge initializes to 0.0 by default in prometheus_client; no explicit initialization to 0 is needed — only the set(1.0) call after startup completes."
Resolve the /segtrain gap: either add it to AC6 explicitly, or add a note "out of scope — tracked in a follow-up." Silence here will cause the implementer to make an undocumented decision.
Clarify the canonical engine label strings: "kraken" and "surya" (recommended), consistent with the Python module names in engines/.

Open Decisions

/segtrain coverage in ocr_training_runs_total: include or exclude? Segmentation training is a separate model (blla), a different endpoint, and less frequently used. Including it adds completeness; excluding it keeps AC6 simpler. Either is valid — but the implementer needs an explicit answer to avoid a silent decision.

## 📋 Elicit — Requirements Engineer ### Observations - The issue is well-structured: user story, 7 ACs with Given/When/Then, implementation notes, and NFRs. This is above-average spec quality for a developer-facing feature. - **AC2 label completeness**: `ocr_jobs_total` is labelled by `engine` and `script_type`. In `main.py`, the engine is determined by `use_kraken` (bool) and mapped to `kraken_engine` or `surya_engine`. The `engine` label value (`"kraken"` or `"surya"`) needs to be derived from this — the issue doesn't specify exactly which string constant to use. Recommend agreeing on `"kraken"` and `"surya"` as the canonical label values (lowercase, matching engine module names). - **AC5 scope ambiguity**: "per-document processing time in `run_ocr`" and "per-page in `run_ocr_stream`" are different granularities. The issue says to measure "around the thread call in `run_ocr` and per-page in `run_ocr_stream`". This is correct, but the two measurement points produce incomparable data: one measures full-document duration, the other per-page. The Grafana query for "average OCR duration" needs to know which one to use. Clarification: the per-page histogram in `run_ocr_stream` is more useful for operational insight. The `run_ocr` measurement is a bonus. - **AC6 training metrics for `/segtrain`**: the issue lists AC6 only for `POST /train` and `POST /train-sender`. Looking at `main.py`, there is a third training endpoint: `/segtrain` (segmentation model training). Should `ocr_training_runs_total` also be incremented for `/segtrain`? The issue is silent on this. - **NFR-TEST-01 minimal**: "HTTP 200 and includes `ocr_jobs_total` after at least one OCR request" covers only AC1. ACs 2–7 have no corresponding NFR-TEST entries. This is acceptable for an MVP but leaves significant metric behavior untested. - The `ocr_models_ready` Gauge (AC7) default value: before startup completes it should be `0.0`. Prometheus counters/gauges initialize to 0 by default in `prometheus_client`, so `ocr_models_ready` will naturally be 0.0 until explicitly set to 1.0 in the lifespan. This should be called out in the implementation note. ### Recommendations - Add one line to the implementation notes: "The `ocr_models_ready` Gauge initializes to `0.0` by default in `prometheus_client`; no explicit initialization to 0 is needed — only the `set(1.0)` call after startup completes." - Resolve the `/segtrain` gap: either add it to AC6 explicitly, or add a note "out of scope — tracked in a follow-up." Silence here will cause the implementer to make an undocumented decision. - Clarify the canonical `engine` label strings: `"kraken"` and `"surya"` (recommended), consistent with the Python module names in `engines/`. ### Open Decisions - **`/segtrain` coverage in `ocr_training_runs_total`**: include or exclude? Segmentation training is a separate model (blla), a different endpoint, and less frequently used. Including it adds completeness; excluding it keeps AC6 simpler. Either is valid — but the implementer needs an explicit answer to avoid a silent decision.

marcel commented

2026-05-21 15:20:52 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

This is a pure developer-facing observability feature — no user-facing UI is introduced or modified. The /metrics endpoint, Prometheus counters, and Grafana panels are consumed by developers and operators, not by Familienarchiv end users (the 60+ Kurrent transcribers or the millennial readers). This issue has no UX or accessibility implications.

What I checked:

No new SvelteKit routes or components are added.
No new form fields, buttons, or interactive elements are introduced.
The Grafana dashboard panels referenced in the issue comment (#651 Row 4) are out of scope for this issue.
The /metrics endpoint response is plain text (Prometheus exposition format) — not rendered in any user-facing context.

No concerns from my angle. When #651 (Grafana dashboard) is implemented, I'll review the panel layout and information hierarchy at that point.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations This is a pure developer-facing observability feature — no user-facing UI is introduced or modified. The `/metrics` endpoint, Prometheus counters, and Grafana panels are consumed by developers and operators, not by Familienarchiv end users (the 60+ Kurrent transcribers or the millennial readers). This issue has **no UX or accessibility implications**. What I checked: - No new SvelteKit routes or components are added. - No new form fields, buttons, or interactive elements are introduced. - The Grafana dashboard panels referenced in the issue comment (#651 Row 4) are out of scope for this issue. - The `/metrics` endpoint response is plain text (Prometheus exposition format) — not rendered in any user-facing context. No concerns from my angle. When #651 (Grafana dashboard) is implemented, I'll review the panel layout and information hierarchy at that point.

marcel referenced this issue

2026-05-21 15:21:04 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel commented

2026-05-21 15:21:09 +02:00

🗳️ Decision Queue — Action Required

3 decisions need your input before implementation starts.

Implementation

Counter placement for ocr_words_total / ocr_illegible_words_total — call site in main.py (before apply_confidence_markers strips block["words"]) vs. inside confidence.py (where every word is always visible). Call site keeps confidence.py a pure function with no side effects; inside confidence.py ensures the counts are never missed if the function is called from a future code path. Recommended: call site in main.py. (Raised by: Felix)

Requirements

/segtrain coverage in ocr_training_runs_total (AC6) — AC6 specifies POST /train and POST /train-sender, but main.py also has POST /segtrain (segmentation model fine-tuning). Should ocr_training_runs_total be incremented there too? Including it adds completeness; excluding it keeps the first implementation simpler. Either choice is valid — but silence leaves the implementer guessing. (Raised by: Elicit)

Testing

Counter isolation strategy in test_metrics.py — Prometheus Counter/Gauge objects are module-level singletons; values accumulate across tests in the same process. Two approaches: (A) inject a test-scoped CollectorRegistry() into every metric declaration (cleaner, requires registry= param on each Counter(...) in main.py), or (B) assert on relative increments (value_after - value_before) rather than absolute values (simpler, slightly more fragile). Given solo project constraints, option B is pragmatic. (Raised by: Sara)

## 🗳️ Decision Queue — Action Required _3 decisions need your input before implementation starts._ ### Implementation - **Counter placement for `ocr_words_total` / `ocr_illegible_words_total`** — call site in `main.py` (before `apply_confidence_markers` strips `block["words"]`) vs. inside `confidence.py` (where every word is always visible). Call site keeps `confidence.py` a pure function with no side effects; inside `confidence.py` ensures the counts are never missed if the function is called from a future code path. **Recommended: call site in `main.py`.** _(Raised by: Felix)_ ### Requirements - **`/segtrain` coverage in `ocr_training_runs_total` (AC6)** — AC6 specifies `POST /train` and `POST /train-sender`, but `main.py` also has `POST /segtrain` (segmentation model fine-tuning). Should `ocr_training_runs_total` be incremented there too? Including it adds completeness; excluding it keeps the first implementation simpler. Either choice is valid — but silence leaves the implementer guessing. _(Raised by: Elicit)_ ### Testing - **Counter isolation strategy in `test_metrics.py`** — Prometheus `Counter`/`Gauge` objects are module-level singletons; values accumulate across tests in the same process. Two approaches: (A) inject a test-scoped `CollectorRegistry()` into every metric declaration (cleaner, requires `registry=` param on each `Counter(...)` in `main.py`), or (B) assert on relative increments (`value_after - value_before`) rather than absolute values (simpler, slightly more fragile). Given solo project constraints, option B is pragmatic. _(Raised by: Sara)_

marcel referenced this issue

2026-05-21 15:21:22 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel referenced this issue

2026-05-21 15:22:01 +02:00

As a product owner I want a Grafana overview dashboard so I can check system health and archive progress at a weekly glance #651

marcel commented

2026-05-21 15:45:42 +02:00

Call site is main.py. We should do seperate training runs, segtrain and real OCR training. We will do A injection.

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

test(ocr): assert ocr_jobs_total label is engine=surya for typewriter

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): increment ocr_pages_total per successful page in stream

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): increment ocr_skipped_pages_total on per-page engine failure

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): count words and illegible words at the OCR call sites

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): observe ocr_processing_seconds around engine.to_thread calls

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): record training runs in ocr_training_runs_total per kind and outcome

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

test(ocr): assert ocr_model_accuracy gauge is set per kind on success

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

feat(ocr): suppress uvicorn access logs for /metrics and /health

marcel referenced this issue from a commit

2026-05-21 16:17:26 +02:00

ops(observability): drop TODO from ocr-service scrape job in prometheus.yml

marcel referenced a pull request that will close this issue

2026-05-21 16:17:47 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel commented

2026-05-21 16:18:09 +02:00

✅ Implementation complete — PR #653

Implemented via TDD on feat/issue-652-ocr-metrics. PR: #653

Commits (red/green/refactor, atomic)

#	Commit	AC
1	`feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator` (`18c93d4e`)	AC1
2	`test(ocr): assert http_* metrics appear after an /ocr request` (`4bb6685e`)	AC1
3	`feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support` (`f3e3545d`)	decision #3
4	`feat(ocr): increment ocr_jobs_total with engine and script_type labels` (`696b71da`)	AC2
5	`test(ocr): assert ocr_jobs_total label is engine=surya for typewriter` (`52d8dc2b`)	AC2
6	`feat(ocr): increment ocr_pages_total per successful page in stream` (`79edb945`)	AC3a
7	`feat(ocr): increment ocr_skipped_pages_total on per-page engine failure` (`3fa3460d`)	AC3b
8	`feat(ocr): count words and illegible words at the OCR call sites` (`131ed336`)	AC4
9	`feat(ocr): observe ocr_processing_seconds around engine.to_thread calls` (`2e3744d9`)	AC5
10	`feat(ocr): record training runs in ocr_training_runs_total per kind and outcome` (`6c2b9af1`)	AC6
11	`test(ocr): assert ocr_model_accuracy gauge is set per kind on success` (`77d59c5d`)	AC6
12	`feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes` (`d6abf990`)	AC7
13	`feat(ocr): suppress uvicorn access logs for /metrics and /health` (`525f091b`)	Nora
14	`ops(observability): drop TODO from ocr-service scrape job in prometheus.yml` (`e75ac8ec`)	—

Decisions implemented as resolved

Word counters in main.py — increment at the call sites in both /ocr and /ocr/stream before apply_confidence_markers strips block["words"]; confidence.py stays a pure function.
/segtrain tracked separately via label-based separation: ocr_training_runs_total{kind, outcome} and ocr_model_accuracy{kind} with kind=recognition (/train + /train-sender) vs kind=segmentation (/segtrain).
Option A — test-scoped CollectorRegistry() via new metrics.py factory + fresh_metrics pytest fixture using monkeypatch.setattr("main.metrics", build_metrics(CollectorRegistry())).

Test results

16 new tests in ocr-service/test_metrics.py, all green
43 of 44 pass across test_metrics.py + test_main.py + test_training_auth.py + test_confidence.py. The one failure (test_startup_logs_warning_when_running_as_root) was already failing on main and is unrelated — ASGITransport does not auto-run the lifespan, so its caplog assertion never sees the warning. (My new ocr_models_ready lifespan test drives the lifespan via app.router.lifespan_context(app) to avoid the same trap.)

Next steps

Merge PR #653 → image rebuild → verify Prometheus target flips DOWN → UP at http://localhost:9090/targets.
Then #651 can pick up the ocr_* metric names for the Row 4 panels.

## ✅ Implementation complete — PR #653 Implemented via TDD on `feat/issue-652-ocr-metrics`. PR: https://git.raddatz.cloud/marcel/familienarchiv/pulls/653 ### Commits (red/green/refactor, atomic) | # | Commit | AC | |---|---|---| | 1 | `feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator` (`18c93d4e`) | AC1 | | 2 | `test(ocr): assert http_* metrics appear after an /ocr request` (`4bb6685e`) | AC1 | | 3 | `feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support` (`f3e3545d`) | decision #3 | | 4 | `feat(ocr): increment ocr_jobs_total with engine and script_type labels` (`696b71da`) | AC2 | | 5 | `test(ocr): assert ocr_jobs_total label is engine=surya for typewriter` (`52d8dc2b`) | AC2 | | 6 | `feat(ocr): increment ocr_pages_total per successful page in stream` (`79edb945`) | AC3a | | 7 | `feat(ocr): increment ocr_skipped_pages_total on per-page engine failure` (`3fa3460d`) | AC3b | | 8 | `feat(ocr): count words and illegible words at the OCR call sites` (`131ed336`) | AC4 | | 9 | `feat(ocr): observe ocr_processing_seconds around engine.to_thread calls` (`2e3744d9`) | AC5 | | 10 | `feat(ocr): record training runs in ocr_training_runs_total per kind and outcome` (`6c2b9af1`) | AC6 | | 11 | `test(ocr): assert ocr_model_accuracy gauge is set per kind on success` (`77d59c5d`) | AC6 | | 12 | `feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes` (`d6abf990`) | AC7 | | 13 | `feat(ocr): suppress uvicorn access logs for /metrics and /health` (`525f091b`) | Nora | | 14 | `ops(observability): drop TODO from ocr-service scrape job in prometheus.yml` (`e75ac8ec`) | — | ### Decisions implemented as resolved 1. **Word counters in `main.py`** — increment at the call sites in both `/ocr` and `/ocr/stream` before `apply_confidence_markers` strips `block["words"]`; `confidence.py` stays a pure function. 2. **`/segtrain` tracked separately** via label-based separation: `ocr_training_runs_total{kind, outcome}` and `ocr_model_accuracy{kind}` with `kind=recognition` (`/train` + `/train-sender`) vs `kind=segmentation` (`/segtrain`). 3. **Option A — test-scoped `CollectorRegistry()`** via new `metrics.py` factory + `fresh_metrics` pytest fixture using `monkeypatch.setattr("main.metrics", build_metrics(CollectorRegistry()))`. ### Test results - 16 new tests in `ocr-service/test_metrics.py`, all green - 43 of 44 pass across `test_metrics.py + test_main.py + test_training_auth.py + test_confidence.py`. The one failure (`test_startup_logs_warning_when_running_as_root`) was already failing on `main` and is unrelated — `ASGITransport` does not auto-run the lifespan, so its `caplog` assertion never sees the warning. (My new `ocr_models_ready` lifespan test drives the lifespan via `app.router.lifespan_context(app)` to avoid the same trap.) ### Next steps - Merge PR #653 → image rebuild → verify Prometheus target flips DOWN → UP at `http://localhost:9090/targets`. - Then #651 can pick up the `ocr_*` metric names for the Row 4 panels.

marcel referenced this issue

2026-05-21 16:45:37 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue

2026-05-21 16:47:00 +02:00

feat(ocr): expose Prometheus /metrics endpoint with OCR-domain counters #653

marcel referenced this issue from a commit

2026-05-21 17:07:16 +02:00

docs(observability): document ocr metrics, scrape edge, and access-log filter

marcel referenced this issue

2026-05-21 17:10:42 +02:00

Add Prometheus alerting rules for OCR service #654