ci(observability): inject GRAFANA_DB_PASSWORD from Gitea secrets

Wires the new GRAFANA_DB_PASSWORD secret through the deploy pipeline: - docker-compose.prod.yml: backend env now passes GRAFANA_DB_PASSWORD through so Flyway V68 can resolve the ${grafanaDbPassword} placeholder in production and staging (it already worked in local dev via docker-compose.yml). - release.yml + nightly.yml: declare GRAFANA_DB_PASSWORD as a required Gitea secret, write it into .env.production / .env.staging (consumed by archive-backend), and into /opt/familienarchiv/obs-secrets.env (consumed by obs-grafana's PostgreSQL datasource). Operator action before the next deploy: add a GRAFANA_DB_PASSWORD value to the Gitea repo secrets (openssl rand -hex 32). Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboard
2026-05-21 20:21:27 +02:00 · 2026-05-21 20:21:05 +02:00 · 2026-05-21 20:21:05 +02:00 · 2026-05-21 20:21:05 +02:00 · 2026-05-21 20:21:05 +02:00 · 2026-05-21 20:21:05 +02:00
22 changed files with 1839 additions and 14 deletions
--- a/.env.example
+++ b/.env.example
@@ -39,6 +39,12 @@ PORT_PROMETHEUS=9090
 # Grafana admin password — change this before exposing Grafana beyond localhost
 GRAFANA_ADMIN_PASSWORD=changeme

+# Password for the read-only grafana_reader PostgreSQL role used by the PO
+# Overview dashboard. Consumed by Flyway V68 (to set the role's password) and
+# by Grafana's PostgreSQL datasource (to connect). REQUIRED in production —
+# generate with: openssl rand -hex 32
+GRAFANA_DB_PASSWORD=changeme-generate-with-openssl-rand-hex-32
+
 # GlitchTip domain — production: use https://glitchtip.archiv.raddatz.cloud (must match Caddy vhost)
 GLITCHTIP_DOMAIN=http://localhost:3002

--- a/.gitea/workflows/nightly.yml
+++ b/.gitea/workflows/nightly.yml
@@ -31,6 +31,7 @@ name: nightly
 #   STAGING_APP_ADMIN_USERNAME
 #   STAGING_APP_ADMIN_PASSWORD
 #   GRAFANA_ADMIN_PASSWORD
+#   GRAFANA_DB_PASSWORD           (read-only grafana_reader DB role, issue #651)
 #   GLITCHTIP_SECRET_KEY
 #   SENTRY_DSN                  (set after GlitchTip first-run; empty = Sentry disabled)

@@ -80,6 +81,7 @@ jobs:
          POSTGRES_USER=archiv
          SENTRY_DSN=${{ secrets.SENTRY_DSN }}
          VITE_SENTRY_DSN=${{ secrets.VITE_SENTRY_DSN }}
+          GRAFANA_DB_PASSWORD=${{ secrets.GRAFANA_DB_PASSWORD }}
          EOF

      - name: Verify backend /import:ro mount is wired
@@ -143,6 +145,7 @@ jobs:
          cp docker-compose.observability.yml /opt/familienarchiv/
          cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
          GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
+          GRAFANA_DB_PASSWORD=${{ secrets.GRAFANA_DB_PASSWORD }}
          GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
          POSTGRES_PASSWORD=${{ secrets.STAGING_POSTGRES_PASSWORD }}
          POSTGRES_HOST=archiv-staging-db-1
--- a/.gitea/workflows/release.yml
+++ b/.gitea/workflows/release.yml
@@ -35,6 +35,7 @@ name: release
 #   MAIL_USERNAME
 #   MAIL_PASSWORD
 #   GRAFANA_ADMIN_PASSWORD
+#   GRAFANA_DB_PASSWORD           (read-only grafana_reader DB role, issue #651)
 #   GLITCHTIP_SECRET_KEY
 #   SENTRY_DSN                    (set after GlitchTip first-run; empty = Sentry disabled)

@@ -77,6 +78,7 @@ jobs:
          IMPORT_HOST_DIR=/srv/familienarchiv-production/import
          POSTGRES_USER=archiv
          SENTRY_DSN=${{ secrets.SENTRY_DSN }}
+          GRAFANA_DB_PASSWORD=${{ secrets.GRAFANA_DB_PASSWORD }}
          EOF

      - name: Build images
@@ -110,6 +112,7 @@ jobs:
          cp docker-compose.observability.yml /opt/familienarchiv/
          cat > /opt/familienarchiv/obs-secrets.env <<'EOF'
          GRAFANA_ADMIN_PASSWORD=${{ secrets.GRAFANA_ADMIN_PASSWORD }}
+          GRAFANA_DB_PASSWORD=${{ secrets.GRAFANA_DB_PASSWORD }}
          GLITCHTIP_SECRET_KEY=${{ secrets.GLITCHTIP_SECRET_KEY }}
          POSTGRES_PASSWORD=${{ secrets.PROD_POSTGRES_PASSWORD }}
          POSTGRES_HOST=archiv-production-db-1
--- a/backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java
+++ b/backend/src/main/java/org/raddatz/familienarchiv/config/FlywayConfig.java
@@ -7,12 +7,15 @@ import org.springframework.context.annotation.Bean;
 import org.springframework.context.annotation.Configuration;

 import javax.sql.DataSource;
+import java.util.Map;

@Configuration
@RequiredArgsConstructor
@Slf4j
 public class FlywayConfig {

+    private static final String GRAFANA_DB_PASSWORD_FALLBACK = "changeme-grafana-db-password";
+
    private final DataSource dataSource;

    @Bean(name = "flyway")
@@ -21,6 +24,7 @@ public class FlywayConfig {
        Flyway flyway = Flyway.configure()
                .dataSource(dataSource)
                .locations("classpath:db/migration")
+                .placeholders(Map.of("grafanaDbPassword", resolveGrafanaDbPassword()))
                .baselineOnMigrate(true)
                .baselineVersion("4")
                .load();
@@ -28,4 +32,14 @@ public class FlywayConfig {
        log.info("Flyway: {} migration(s) applied.", result.migrationsExecuted);
        return flyway;
    }
+
+    private String resolveGrafanaDbPassword() {
+        String value = System.getenv("GRAFANA_DB_PASSWORD");
+        if (value == null || value.isBlank()) {
+            log.warn("GRAFANA_DB_PASSWORD is not set; the grafana_reader role will use a non-secret fallback. "
+                    + "Set GRAFANA_DB_PASSWORD in production to enable the Grafana PostgreSQL datasource.");
+            return GRAFANA_DB_PASSWORD_FALLBACK;
+        }
+        return value;
+    }
 }
--- a/backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql
+++ b/backend/src/main/resources/db/migration/V68__add_grafana_reader_role.sql
@@ -0,0 +1,17 @@
+-- Read-only role used by the Grafana PostgreSQL datasource for the PO Overview
+-- dashboard (issue #651). Password is injected at migration time via the Flyway
+-- placeholder ${grafanaDbPassword}, supplied by FlywayConfig from the
+-- GRAFANA_DB_PASSWORD environment variable.
+DO $$
+BEGIN
+    IF NOT EXISTS (SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = 'grafana_reader') THEN
+        EXECUTE format('CREATE ROLE grafana_reader WITH LOGIN PASSWORD %L', '${grafanaDbPassword}');
+    ELSE
+        EXECUTE format('ALTER ROLE grafana_reader WITH LOGIN PASSWORD %L', '${grafanaDbPassword}');
+    END IF;
+END
+$$;
+
+GRANT CONNECT ON DATABASE ${flyway:database} TO grafana_reader;
+GRANT USAGE  ON SCHEMA   public               TO grafana_reader;
+GRANT SELECT ON audit_log, documents, transcription_blocks TO grafana_reader;
--- a/backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java
+++ b/backend/src/test/java/org/raddatz/familienarchiv/config/GrafanaReaderRoleIntegrationTest.java
@@ -0,0 +1,47 @@
+package org.raddatz.familienarchiv.config;
+
+import org.junit.jupiter.api.Test;
+import org.raddatz.familienarchiv.PostgresContainerConfig;
+import org.springframework.beans.factory.annotation.Autowired;
+import org.springframework.boot.data.jpa.test.autoconfigure.DataJpaTest;
+import org.springframework.boot.jdbc.test.autoconfigure.AutoConfigureTestDatabase;
+import org.springframework.context.annotation.Import;
+import org.springframework.jdbc.core.JdbcTemplate;
+
+import static org.assertj.core.api.Assertions.assertThat;
+
+@DataJpaTest
+@AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
+@Import({PostgresContainerConfig.class, FlywayConfig.class})
+class GrafanaReaderRoleIntegrationTest {
+
+    @Autowired JdbcTemplate jdbc;
+
+    @Test
+    void grafana_reader_has_select_on_audit_log() {
+        assertThat(hasSelect("audit_log")).isTrue();
+    }
+
+    @Test
+    void grafana_reader_has_select_on_documents() {
+        assertThat(hasSelect("documents")).isTrue();
+    }
+
+    @Test
+    void grafana_reader_has_select_on_transcription_blocks() {
+        assertThat(hasSelect("transcription_blocks")).isTrue();
+    }
+
+    @Test
+    void grafana_reader_has_no_select_on_app_users() {
+        assertThat(hasSelect("app_users")).isFalse();
+    }
+
+    private boolean hasSelect(String table) {
+        Boolean result = jdbc.queryForObject(
+                "SELECT has_table_privilege('grafana_reader', ?, 'SELECT')",
+                Boolean.class,
+                table);
+        return Boolean.TRUE.equals(result);
+    }
+}
--- a/docker-compose.observability.yml
+++ b/docker-compose.observability.yml
@@ -147,6 +147,9 @@ services:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-changeme}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: ${GF_SERVER_ROOT_URL:-http://localhost:3003}
+      # Read-only password for the grafana_reader PostgreSQL role; interpolated
+      # into the provisioned PostgreSQL datasource (see datasources.yml).
+      GRAFANA_DB_PASSWORD: ${GRAFANA_DB_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./infra/observability/grafana/provisioning:/etc/grafana/provisioning:ro
@@ -165,6 +168,7 @@ services:
        condition: service_healthy
    networks:
      - obs-net
+      - archiv-net   # PO Overview dashboard queries archive-db via the grafana_reader role

  # --- Error Tracking: GlitchTip ---

--- a/docker-compose.prod.yml
+++ b/docker-compose.prod.yml
@@ -227,6 +227,9 @@ services:
      SPRING_DATASOURCE_URL: jdbc:postgresql://db:5432/archiv
      SPRING_DATASOURCE_USERNAME: archiv
      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
+      # Consumed by Flyway V68 via the ${grafanaDbPassword} placeholder to set
+      # the read-only grafana_reader role's password.
+      GRAFANA_DB_PASSWORD: ${GRAFANA_DB_PASSWORD}
      # Application uses the bucket-scoped service account, not MinIO root.
      S3_ENDPOINT: http://minio:9000
      S3_ACCESS_KEY: archiv-app
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -163,6 +163,9 @@ services:
      SPRING_DATASOURCE_URL: jdbc:postgresql://db:5432/${POSTGRES_DB}
      SPRING_DATASOURCE_USERNAME: ${POSTGRES_USER}
      SPRING_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD}
+      # Consumed by Flyway V68 via the ${grafanaDbPassword} placeholder to set
+      # the read-only grafana_reader role's password.
+      GRAFANA_DB_PASSWORD: ${GRAFANA_DB_PASSWORD}
      S3_ENDPOINT: http://minio:9000
      S3_ACCESS_KEY: ${MINIO_ROOT_USER}
      S3_SECRET_KEY: ${MINIO_ROOT_PASSWORD}
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -152,6 +152,7 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `PORT_GRAFANA` | Host port for the Grafana UI (bound to `127.0.0.1` only) | `3003` | — | — |
 | `POSTGRES_HOST` | PostgreSQL hostname for GlitchTip's db-init job and workers. Override when only the staging stack is running and `archive-db` is not resolvable by that name. | `archive-db` | — | — |
 | `GRAFANA_ADMIN_PASSWORD` | Grafana `admin` user password | `changeme` | YES (prod) | YES |
+| `GRAFANA_DB_PASSWORD` | Password for the read-only `grafana_reader` PostgreSQL role used by the PO Overview dashboard (issue #651). Consumed by Flyway V68 and the Grafana PostgreSQL datasource. Generate with `openssl rand -hex 32`. | — | YES (prod) | YES |
 | `PORT_GLITCHTIP` | Host port for the GlitchTip UI (bound to `127.0.0.1` only) | `3002` | — | — |
 | `GLITCHTIP_DOMAIN` | Public-facing base URL for GlitchTip (used in email links and CORS) | `http://localhost:3002` | YES (prod) | — |
 | `GLITCHTIP_SECRET_KEY` | Django secret key for GlitchTip — generate with `python3 -c "import secrets; print(secrets.token_hex(32))"` | — | YES | YES |
@@ -256,6 +257,7 @@ git.raddatz.cloud      A   <server IP>
 | `MAIL_USERNAME` | release.yml | SMTP user |
 | `MAIL_PASSWORD` | release.yml | SMTP password |
 | `GRAFANA_ADMIN_PASSWORD` | both | Grafana `admin` login — generate a strong password |
+| `GRAFANA_DB_PASSWORD` | both | Read-only `grafana_reader` role password — `openssl rand -hex 32` |
 | `GLITCHTIP_SECRET_KEY` | both | Django secret key — `openssl rand -hex 32` |
 | `SENTRY_DSN` | both | GlitchTip project DSN — set after first-run (§4); leave empty to keep Sentry disabled |
 | `VITE_SENTRY_DSN` | both | GlitchTip frontend project DSN — set after first-run (§4); leave empty to keep Sentry disabled |
@@ -357,6 +359,7 @@ Both files are passed explicitly via `--env-file` to the compose command, so the
 | Gitea secret | Notes |
 |---|---|
 | `GRAFANA_ADMIN_PASSWORD` | Strong unique password; shared by nightly and release |
+| `GRAFANA_DB_PASSWORD` | `openssl rand -hex 32`; shared by nightly and release — read-only DB role for the PO Overview dashboard |
 | `GLITCHTIP_SECRET_KEY` | `openssl rand -hex 32`; shared by nightly and release |
 | `STAGING_POSTGRES_PASSWORD` / `PROD_POSTGRES_PASSWORD` | Must match the running PostgreSQL container |

--- a/docs/GLOSSARY.md
+++ b/docs/GLOSSARY.md
@@ -80,6 +80,14 @@ _See also [DocumentStatus lifecycle](#documentstatus-lifecycle)._

 **Sütterlin** — A specific standardized style of Kurrent taught in German schools from 1915 to 1941.

+**Illegible word** — a word whose recognition confidence falls below the configured threshold; replaced with the literal token `[unleserlich]` in the rendered block text and counted in the `ocr_illegible_words_total` Prometheus counter.
+
+**Models-ready gauge** — the `ocr_models_ready` Prometheus gauge, flipped from `0` to `1` once the FastAPI lifespan startup has finished loading the Kraken model and the spell-checker. Used both for the `/health` endpoint and as the supervised signal for the `ocr_models_ready < 1 for 2m` alert.
+
+**Recognition model accuracy** — the accuracy reported by `ketos train` for the recognition (text-line) model, exposed as `ocr_model_accuracy{kind="recognition"}`. Sourced from `_parse_best_checkpoint` on the highest-scoring checkpoint after training.
+
+**Segmentation model accuracy** — the accuracy reported by `ketos segtrain` for the baseline layout analysis (`blla`) model, exposed as `ocr_model_accuracy{kind="segmentation"}`. Distinct from recognition accuracy because the two models are trained and improved independently.
+
 ---

 ## Other Domain Terms
--- a/docs/OBSERVABILITY.md
+++ b/docs/OBSERVABILITY.md
@@ -118,11 +118,14 @@ To find a trace for a specific request in staging/production, either increase th

 ## Metrics (Prometheus → Grafana)

-Prometheus scrapes the backend management endpoint every 15 s:
+Prometheus scrapes two targets every 15 s:

 ```
 Target: backend:8081/actuator/prometheus
 Labels: job="spring-boot", application="Familienarchiv"
+
+Target: ocr:8000/metrics
+Labels: job="ocr-service"
 ```

 All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.
@@ -146,6 +149,70 @@ jvm_memory_used_bytes{area="heap", application="Familienarchiv"}
 hikaricp_connections_active
 ```

+### OCR-service custom metrics
+
+Exposed at `ocr:8000/metrics` by `prometheus-fastapi-instrumentator`. The
+`http_*` metrics describe the FastAPI request layer; the `ocr_*` series are
+domain-specific. **Never label these with PII or document content** — labels
+have unbounded cardinality risk and are visible to anyone with Grafana access.
+
+| Metric | Type | Labels | Unit | What it tracks |
+|---|---|---|---|---|
+| `ocr_jobs_total` | Counter | `engine` (`surya`/`kraken`), `script_type` | jobs | OCR jobs that started after a successful PDF download |
+| `ocr_pages_total` | Counter | `engine` | pages | Successfully OCR'd pages in the streaming generator |
+| `ocr_skipped_pages_total` | Counter | — | pages | Pages skipped because the engine raised on them |
+| `ocr_words_total` | Counter | — | words | Recognized words summed across every block |
+| `ocr_illegible_words_total` | Counter | — | words | Words below the confidence threshold (rendered as `[unleserlich]`) |
+| `ocr_processing_seconds` | Histogram | `engine` | seconds | Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing |
+| `ocr_training_runs_total` | Counter | `kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`) | runs | Completed training runs |
+| `ocr_model_accuracy` | Gauge | `kind` | ratio (0–1) | Latest accuracy reported by a successful training run |
+| `ocr_models_ready` | Gauge | — | 0\|1 | 1 once the lifespan startup has finished loading models |
+
+Canonical example queries (the same ones referenced in issue #652):
+
+```promql
+# OCR throughput by engine
+sum by (engine) (rate(ocr_pages_total[5m]))
+
+# Share of words rendered as [unleserlich]
+sum(rate(ocr_illegible_words_total[5m]))
+  / sum(rate(ocr_words_total[5m]))
+
+# p95 page processing time per engine
+histogram_quantile(0.95, sum by (engine, le) (
+  rate(ocr_processing_seconds_bucket[5m])
+))
+
+# Training error rate
+sum(rate(ocr_training_runs_total{outcome="error"}[1h]))
+  / sum(rate(ocr_training_runs_total[1h]))
+
+# Latest recognition vs segmentation accuracy
+ocr_model_accuracy
+```
+
+### Internal-only endpoints
+
+`/metrics` is exposed by the OCR service over plain HTTP without
+authentication. The container is reachable only on the internal Docker
+network — Caddy never proxies to it directly. If the service is ever
+exposed (e.g. a `ports:` mapping is added), block the endpoint at the
+reverse proxy:
+
+```caddy
+ocr.example.com {
+    @internal_only path /metrics /health
+    respond @internal_only 404
+    reverse_proxy ocr:8000
+}
+```
+
+The `MetricsPathFilter` in `ocr-service/main.py` suppresses uvicorn's
+**stdout** access log lines for `/metrics` and `/health` so the container
+console stays focused on real OCR traffic. Promtail/Loki still receive
+access lines from any other source. Treat the filter as console
+noise-control, not an audit-suppression mechanism.
+
 ## Errors (GlitchTip)

 GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.
--- a/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md
+++ b/docs/adr/023-prometheus-instrumentator-and-metrics-registry-injection.md
@@ -0,0 +1,94 @@
+# ADR-023: Prometheus Instrumentator and Metrics Registry Injection
+
+## Status
+
+Accepted
+
+## Context
+
+Until issue #652 the OCR service exposed no `/metrics` endpoint. The
+observability stack already scrapes the Spring Boot backend's actuator
+endpoint, but it had nothing to scrape on the Python side. Without HTTP-
+and domain-level metrics from `ocr-service` we cannot answer questions
+like "what is the share of words rendered as `[unleserlich]`" or
+"is the training error rate above its budget" from Grafana.
+
+Two implementation requirements influenced the design:
+
+1. **Counter / gauge isolation in tests.** `prometheus_client` collectors
+   are module-level singletons keyed by name on the global `REGISTRY`.
+   Re-importing or naively re-instantiating them raises a duplicated-
+   collector error and cross-test state leaks (a `.inc()` in test A is
+   still readable by test B). A test harness needs a way to swap the
+   active container for a fresh per-test instance.
+
+2. **Minimal blast radius on the request path.** We did not want to
+   hand-instrument every endpoint with FastAPI middleware. The
+   `prometheus-fastapi-instrumentator` library already provides
+   `http_requests_total`, `http_request_duration_seconds`, and the
+   `/metrics` exposition route, all idiomatic Prometheus names.
+
+## Decision
+
+- Add `prometheus-fastapi-instrumentator==7.0.0` and pin its transitive
+  dependency `prometheus-client==0.25.0` explicitly in
+  `ocr-service/requirements.txt`.
+- Mount the instrumentator once at module load:
+  `Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app)`.
+  This adds `/metrics` and an HTTP-level dashboard surface without
+  changing any endpoint code.
+- Define every domain metric (`ocr_jobs_total`, `ocr_pages_total`,
+  `ocr_processing_seconds`, …) inside a `build_metrics(registry)`
+  factory in `ocr-service/metrics.py` that returns a frozen `OcrMetrics`
+  dataclass. Production code binds the container to the default
+  `REGISTRY` once: `metrics: OcrMetrics = build_metrics(REGISTRY)`.
+- Tests use a `fresh_metrics` fixture that builds a new
+  `CollectorRegistry()` per test and monkeypatches `main.metrics` with
+  a container bound to it. The endpoint code keeps reading
+  `metrics.<name>` without knowing whether it is talking to the global
+  registry or a per-test one.
+
+## Consequences
+
+**Positive**
+
+- One reusable factory captures the metric definitions; future metrics
+  go in one place.
+- Tests run with full counter isolation. Cross-test state leakage is
+  impossible because each test sees its own dataclass instance.
+- The instrumentator gives us `http_*` metrics for free, including a
+  Grafana-ready histogram that pairs with the Spring Boot one.
+
+**Negative**
+
+- One extra level of indirection: any test that asserts on metric
+  values must remember to monkeypatch `main.metrics`, not the registry
+  directly. Rebinding through the registry is harmless but useless —
+  the dataclass holds references to the original collectors.
+- `prometheus-client` is now pinned. Upgrading it requires an explicit
+  bump and re-checking the instrumentator's compatibility range.
+- `/metrics` is exposed unauthenticated and relies on the Docker
+  internal network for confidentiality. See
+  [docs/OBSERVABILITY.md §Internal-only endpoints](../OBSERVABILITY.md)
+  for the Caddy snippet that must be added if the service ever gets a
+  host-side port mapping.
+
+## Alternatives considered
+
+- **Hand-roll the `/metrics` endpoint.** Rejected: would have meant
+  duplicating what `prometheus-fastapi-instrumentator` ships, plus
+  middleware for the HTTP histograms.
+- **Skip the factory; pass `registry` as a function argument
+  everywhere.** Rejected: clutters every endpoint signature and breaks
+  the symmetry with the Spring Boot side, which also relies on a
+  process-global Micrometer registry.
+- **Use a `pytest` autouse fixture that resets `REGISTRY` between
+  tests.** Rejected: `prometheus_client` does not expose a clean
+  "unregister all" hook, and we would be relying on private APIs.
+
+## References
+
+- Issue: [#652](https://git.raddatz.cloud/marcel/familienarchiv/issues/652)
+- Library: <https://github.com/trallnag/prometheus-fastapi-instrumentator>
+- Code: `ocr-service/metrics.py`, `ocr-service/main.py`,
+  `ocr-service/test_metrics.py`
--- a/docs/architecture/c4/l2-containers.puml
+++ b/docs/architecture/c4/l2-containers.puml
@@ -43,9 +43,12 @@ Rel(ocr, storage, "Fetches PDF via presigned URL", "HTTP / S3 presigned")
 Rel(mc, storage, "Bootstraps bucket + service account on startup", "MinIO Client CLI")
 Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")
 Rel(backend, tempo, "Sends distributed traces via OTLP", "HTTP / OTLP / port 4318 (archiv-net)")
+Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus")
+Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")
 Rel(grafana, prometheus, "Queries metrics", "HTTP 9090")
 Rel(grafana, loki, "Queries logs", "HTTP 3100")
 Rel(grafana, tempo, "Queries traces", "HTTP 3200")
+Rel(grafana, db, "Read-only dashboard queries via grafana_reader role", "PostgreSQL / archiv-net")
 Rel(glitchtip, db, "Stores error events in glitchtip DB", "PostgreSQL / archiv-net")
 Rel(obs_glitchtip_worker, obs_redis, "Processes Celery tasks", "Redis / obs-net")

--- a/infra/observability/grafana/provisioning/dashboards/po-overview.json
+++ b/infra/observability/grafana/provisioning/dashboards/po-overview.json
@@ -0,0 +1,702 @@
+{
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": { "type": "grafana", "uid": "grafana" },
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "description": "Product owner overview — system health, user activity, archive progress, and OCR quality at a weekly glance.",
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 0,
+  "id": null,
+  "links": [],
+  "liveNow": false,
+  "panels": [
+    {
+      "collapsed": false,
+      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
+      "id": 100,
+      "title": "System Health",
+      "type": "row",
+      "panels": []
+    },
+    {
+      "id": 1,
+      "title": "Backend Status",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 1 },
+      "targets": [
+        {
+          "expr": "up{job=\"spring-boot\"}",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "mappings": [
+            { "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } },
+            { "type": "value", "options": { "1": { "text": "UP", "color": "green" } } }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "red", "value": null },
+              { "color": "green", "value": 1 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "textMode": "value"
+      }
+    },
+    {
+      "id": 2,
+      "title": "Server Errors (5xx)",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 1 },
+      "targets": [
+        {
+          "expr": "sum(increase(http_server_requests_seconds_count{status=~\"5..\"}[$__range]))",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 1 },
+              { "color": "red", "value": 6 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 3,
+      "title": "Response Time (p95)",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 1 },
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[$__range])) by (le))",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "s",
+          "decimals": 2,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 0.5 },
+              { "color": "red", "value": 2 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 4,
+      "title": "Error Log Count",
+      "type": "stat",
+      "datasource": { "type": "loki", "uid": "loki" },
+      "gridPos": { "h": 4, "w": 6, "x": 18, "y": 1 },
+      "targets": [
+        {
+          "expr": "sum(count_over_time({compose_service=\"backend\"} | json | level=\"ERROR\" [$__range]))",
+          "queryType": "instant",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 1 },
+              { "color": "red", "value": 10 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 5,
+      "title": "CPU Usage",
+      "type": "bargauge",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 5, "w": 8, "x": 0, "y": 5 },
+      "targets": [
+        {
+          "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 70 },
+              { "color": "red", "value": 85 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "showUnfilled": true
+      }
+    },
+    {
+      "id": 6,
+      "title": "Memory Usage",
+      "type": "bargauge",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 5, "w": 8, "x": 8, "y": 5 },
+      "targets": [
+        {
+          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 70 },
+              { "color": "red", "value": 85 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "showUnfilled": true
+      }
+    },
+    {
+      "id": 7,
+      "title": "Disk Usage",
+      "type": "bargauge",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 5, "w": 8, "x": 16, "y": 5 },
+      "targets": [
+        {
+          "expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"})) * 100",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 70 },
+              { "color": "red", "value": 80 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "showUnfilled": true
+      }
+    },
+    {
+      "collapsed": false,
+      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 10 },
+      "id": 101,
+      "title": "User Activity",
+      "type": "row",
+      "panels": []
+    },
+    {
+      "id": 8,
+      "title": "Active Users",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 8, "x": 0, "y": 11 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(DISTINCT actor_id) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 9,
+      "title": "Total Logins",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 8, "x": 8, "y": 11 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 10,
+      "title": "Failed Login Attempts",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 8, "x": 16, "y": 11 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind IN ('LOGIN_FAILED', 'LOGIN_RATE_LIMITED')",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 1 },
+              { "color": "red", "value": 4 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 11,
+      "title": "Daily Logins (last 7 days)",
+      "type": "barchart",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 7, "w": 24, "x": 0, "y": 15 },
+      "targets": [
+        {
+          "rawSql": "SELECT DATE_TRUNC('day', happened_at) AS time, COUNT(*) AS logins FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'LOGIN_SUCCESS' GROUP BY 1 ORDER BY 1",
+          "format": "time_series",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "legend": { "displayMode": "hidden" },
+        "orientation": "auto",
+        "showValue": "auto",
+        "stacking": "none",
+        "xTickLabelRotation": 0,
+        "xTickLabelSpacing": 0
+      }
+    },
+    {
+      "collapsed": false,
+      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 22 },
+      "id": 102,
+      "title": "Archive Progress",
+      "type": "row",
+      "panels": []
+    },
+    {
+      "id": 12,
+      "title": "Transcription Coverage",
+      "type": "bargauge",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 5, "w": 24, "x": 0, "y": 23 },
+      "targets": [
+        {
+          "rawSql": "SELECT (COUNT(*) FILTER (WHERE text IS NOT NULL AND text <> ''))::float * 100.0 / NULLIF(COUNT(*), 0) AS percent_complete FROM transcription_blocks",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percent",
+          "min": 0,
+          "max": 100,
+          "decimals": 1,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "red", "value": null },
+              { "color": "yellow", "value": 25 },
+              { "color": "green", "value": 75 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "showUnfilled": true
+      }
+    },
+    {
+      "id": 13,
+      "title": "Total Documents",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 28 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM documents WHERE status <> 'PLACEHOLDER'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 14,
+      "title": "Uploads This Week",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 28 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'FILE_UPLOADED'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 15,
+      "title": "Blocks Transcribed This Week",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 28 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'TEXT_SAVED'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 16,
+      "title": "Blocks Reviewed This Week",
+      "type": "stat",
+      "datasource": { "type": "postgres", "uid": "postgres" },
+      "gridPos": { "h": 4, "w": 6, "x": 18, "y": 28 },
+      "targets": [
+        {
+          "rawSql": "SELECT COUNT(*) AS value FROM audit_log WHERE happened_at >= NOW() - INTERVAL '7 days' AND kind = 'BLOCK_REVIEWED'",
+          "format": "table",
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "collapsed": false,
+      "gridPos": { "h": 1, "w": 24, "x": 0, "y": 32 },
+      "id": 103,
+      "title": "OCR Health",
+      "type": "row",
+      "panels": []
+    },
+    {
+      "id": 17,
+      "title": "OCR Jobs",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 33 },
+      "targets": [
+        {
+          "expr": "sum(increase(ocr_jobs_total[$__range]))",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "short",
+          "decimals": 0,
+          "color": { "mode": "fixed", "fixedColor": "blue" }
+        }
+      },
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 18,
+      "title": "OCR Page Error Rate",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 6, "y": 33 },
+      "targets": [
+        {
+          "expr": "sum(increase(ocr_skipped_pages_total[$__range])) / clamp_min(sum(increase(ocr_pages_total[$__range])), 1)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percentunit",
+          "decimals": 1,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 0.01 },
+              { "color": "red", "value": 0.05 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 19,
+      "title": "Illegible Word Rate",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 12, "y": 33 },
+      "targets": [
+        {
+          "expr": "sum(increase(ocr_illegible_words_total[$__range])) / clamp_min(sum(increase(ocr_words_total[$__range])), 1)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "unit": "percentunit",
+          "decimals": 1,
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "yellow", "value": 0.1 },
+              { "color": "red", "value": 0.25 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false }
+      }
+    },
+    {
+      "id": 20,
+      "title": "OCR Service Status",
+      "type": "stat",
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "gridPos": { "h": 4, "w": 6, "x": 18, "y": 33 },
+      "targets": [
+        {
+          "expr": "ocr_models_ready",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "mappings": [
+            { "type": "value", "options": { "0": { "text": "NOT READY", "color": "red" } } },
+            { "type": "value", "options": { "1": { "text": "READY", "color": "green" } } }
+          ],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "red", "value": null },
+              { "color": "green", "value": 1 }
+            ]
+          },
+          "color": { "mode": "thresholds" }
+        }
+      },
+      "options": {
+        "colorMode": "background",
+        "graphMode": "none",
+        "reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
+        "textMode": "value"
+      }
+    }
+  ],
+  "refresh": "",
+  "schemaVersion": 39,
+  "tags": ["po-overview", "familienarchiv"],
+  "templating": { "list": [] },
+  "time": { "from": "now-7d", "to": "now" },
+  "timepicker": {},
+  "timezone": "browser",
+  "title": "PO Overview",
+  "uid": "po-overview",
+  "version": 1,
+  "weekStart": ""
+}
--- a/infra/observability/grafana/provisioning/datasources/datasources.yml
+++ b/infra/observability/grafana/provisioning/datasources/datasources.yml
@@ -36,3 +36,19 @@ datasources:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
+
+  # Read-only PostgreSQL datasource for the PO Overview dashboard (issue #651).
+  # Uses the grafana_reader role provisioned by Flyway V68. Traffic stays inside
+  # archiv-net, so sslmode=disable is the deliberate, accepted setting.
+  - name: PostgreSQL
+    type: postgres
+    uid: postgres
+    url: archive-db:5432
+    user: grafana_reader
+    editable: false
+    secureJsonData:
+      password: ${GRAFANA_DB_PASSWORD}
+    jsonData:
+      database: ${POSTGRES_DB}
+      sslmode: disable
+      postgresVersion: 1600
--- a/infra/observability/obs.env
+++ b/infra/observability/obs.env
@@ -16,6 +16,11 @@ GLITCHTIP_DOMAIN=https://glitchtip.archiv.raddatz.cloud

 POSTGRES_USER=archiv

+# Note: GRAFANA_DB_PASSWORD is a secret and is injected by CI from
+# obs-secrets.env (see .env.example for the local-dev declaration).
+# It is consumed by both archive-backend (Flyway V68 placeholder) and
+# obs-grafana (PostgreSQL datasource).
+
 # PostgreSQL hostname for GlitchTip db-init and workers.
 # The actual value depends on the Compose project name — it is not a fixed string.
 # CI sets POSTGRES_HOST in obs-secrets.env per environment:
--- a/infra/observability/prometheus/prometheus.yml
+++ b/infra/observability/prometheus/prometheus.yml
@@ -20,7 +20,4 @@ scrape_configs:
  - job_name: ocr-service
    metrics_path: /metrics
    static_configs:
-      # TODO: remove or add prometheus-client to ocr-service.
-      # The Python OCR service does not currently expose Prometheus metrics.
-      # This target will show as DOWN until prometheus-client is added to ocr-service.
      - targets: ['ocr:8000']
--- a/ocr-service/main.py
+++ b/ocr-service/main.py
@@ -2,6 +2,7 @@

 import asyncio
 import glob
+import inspect
 import io
 import json
 import logging
@@ -10,9 +11,11 @@ import re
 import shutil
 import subprocess
 import tempfile
+import time
 import zipfile
 from contextlib import asynccontextmanager
 from datetime import datetime, timezone
+from typing import Awaitable, Callable
 from urllib.parse import urlparse

 import httpx
@@ -20,8 +23,11 @@ import pypdfium2 as pdfium
 from fastapi import FastAPI, Form, Header, HTTPException, UploadFile
 from fastapi.responses import StreamingResponse
 from PIL import Image
+from prometheus_client import REGISTRY
+from prometheus_fastapi_instrumentator import Instrumentator

 from confidence import apply_confidence_markers, get_threshold
+from metrics import OcrMetrics, build_metrics
 from spell_check import correct_text, load_spell_checker
 from engines import kraken as kraken_engine
 from engines import surya as surya_engine
@@ -37,6 +43,12 @@ logger = logging.getLogger(__name__)

 _models_ready = False

+# One-shot import-time binding to the default REGISTRY. Tests that need a
+# clean counter state must monkeypatch `main.metrics` with a container built
+# from a fresh CollectorRegistry — rebinding through the registry directly
+# will not retarget the references stored in the OcrMetrics dataclass.
+metrics: OcrMetrics = build_metrics(REGISTRY)
+
 ALLOWED_PDF_HOSTS = set(
    h.strip() for h in os.getenv("ALLOWED_PDF_HOSTS", "minio,localhost,127.0.0.1").split(",")
 )
@@ -44,6 +56,42 @@ ALLOWED_PDF_HOSTS = set(
 _SPELL_CHECK_SCRIPT_TYPES = {"HANDWRITING_KURRENT", "HANDWRITING_LATIN"}


+async def _record_training(
+    runner: Callable[[], Awaitable[dict] | dict],
+    kind: str,
+) -> dict:
+    """Run a training callable and record outcome + accuracy metrics.
+
+    Wraps the per-endpoint try/except + outcome counter + accuracy gauge
+    block that used to be repeated at /train, /train-sender, and /segtrain.
+    The runner returns a dict with at least an `accuracy` key; if its value
+    is None, the gauge is left at its default.
+    """
+    try:
+        result = runner()
+        if inspect.isawaitable(result):
+            result = await result
+    except Exception:
+        metrics.ocr_training_runs_total.labels(kind=kind, outcome="error").inc()
+        raise
+    metrics.ocr_training_runs_total.labels(kind=kind, outcome="success").inc()
+    if result.get("accuracy") is not None:
+        metrics.ocr_model_accuracy.labels(kind=kind).set(result["accuracy"])
+    return result
+
+
+def _observe_block_words(words: list[dict], threshold: float) -> None:
+    """Record per-block word counts and below-threshold word counts.
+
+    Pre: `words` is non-empty. Caller checks for that — keeping the helper
+    branch-free makes the call sites read as a single line.
+    """
+    metrics.ocr_words_total.inc(len(words))
+    metrics.ocr_illegible_words_total.inc(
+        sum(1 for w in words if w["confidence"] < threshold)
+    )
+
+
 def _validate_url(url: str) -> None:
    """Validate that the PDF URL points to an allowed host (SSRF protection)."""
    parsed = urlparse(url)
@@ -63,6 +111,7 @@ async def lifespan(app: FastAPI):
    kraken_engine.load_models()
    load_spell_checker()
    _models_ready = True
+    metrics.ocr_models_ready.set(1)
    logger.info("Startup complete — ready to accept requests")

    yield
@@ -72,6 +121,28 @@ async def lifespan(app: FastAPI):

 app = FastAPI(title="Familienarchiv OCR Service", lifespan=lifespan)

+# /metrics is unauthenticated — relies on Docker-internal-network exposure
+# only (CWE-200 risk if `ports:` ever maps 8000 to host). See
+# docs/OBSERVABILITY.md §Internal-only endpoints for the Caddy block snippet.
+Instrumentator(excluded_handlers=["/health", "/metrics"]).instrument(app).expose(app)
+
+
+class MetricsPathFilter(logging.Filter):
+    """Drop uvicorn.access entries for /metrics and /health to keep logs focused."""
+
+    _SUPPRESSED_PATHS = {"/metrics", "/health"}
+
+    def filter(self, record: logging.LogRecord) -> bool:
+        # uvicorn.access formats as: '%s - "%s %s HTTP/%s" %d'
+        if record.args and len(record.args) >= 3:
+            path = record.args[2]
+            if isinstance(path, str) and path in self._SUPPRESSED_PATHS:
+                return False
+        return True
+
+
+logging.getLogger("uvicorn.access").addFilter(MetricsPathFilter())
+

@app.get("/health")
 def health():
@@ -99,7 +170,9 @@ async def run_ocr(request: OcrRequest):
        del img

    script_type = request.scriptType.upper()
+    engine_name = "kraken" if script_type == "HANDWRITING_KURRENT" else "surya"

+    extract_started = time.monotonic()
    if script_type == "HANDWRITING_KURRENT":
        if not kraken_engine.is_available():
            raise HTTPException(
@@ -111,11 +184,18 @@ async def run_ocr(request: OcrRequest):
    else:
        # TYPEWRITER, HANDWRITING_LATIN, UNKNOWN — all use Surya
        blocks = await asyncio.to_thread(surya_engine.extract_blocks, images, request.language)
+    metrics.ocr_processing_seconds.labels(engine=engine_name).observe(
+        time.monotonic() - extract_started
+    )
+
+    metrics.ocr_jobs_total.labels(engine=engine_name, script_type=script_type).inc()

    threshold = get_threshold(script_type)
    for block in blocks:
-        if block.get("words"):
-            block["text"] = apply_confidence_markers(block["words"], threshold)
+        words = block.get("words") or []
+        if words:
+            _observe_block_words(words, threshold)
+            block["text"] = apply_confidence_markers(words, threshold)
        block.pop("words", None)
        if script_type in _SPELL_CHECK_SCRIPT_TYPES:
            block["text"] = correct_text(block["text"])
@@ -146,6 +226,9 @@ async def run_ocr_stream(request: OcrRequest):
        )

    engine = kraken_engine if use_kraken else surya_engine
+    engine_name = "kraken" if use_kraken else "surya"
+
+    metrics.ocr_jobs_total.labels(engine=engine_name, script_type=script_type).inc()

    if request.regions:
        # Guided mode: recognize only the user-drawn annotation regions
@@ -176,12 +259,15 @@ async def run_ocr_stream(request: OcrRequest):
                    image = await asyncio.to_thread(preprocess_page, image)
                    blocks = []
                    sender_path = request.senderModelPath if use_kraken else None
+                    engine_seconds = 0.0
                    for region in page_regions:
+                        region_started = time.monotonic()
                        text = await asyncio.to_thread(
                            engine.extract_region_text, image,
                            region.x, region.y, region.width, region.height,
                            sender_path,
                        )
+                        engine_seconds += time.monotonic() - region_started
                        if script_type in _SPELL_CHECK_SCRIPT_TYPES:
                            text = correct_text(text)
                        blocks.append({
@@ -195,7 +281,11 @@ async def run_ocr_stream(request: OcrRequest):
                            "annotationId": region.annotationId,
                        })

+                    metrics.ocr_processing_seconds.labels(engine=engine_name).observe(
+                        engine_seconds
+                    )
                    total_blocks += len(blocks)
+                    metrics.ocr_pages_total.labels(engine=engine_name).inc()
                    yield json.dumps({
                        "type": "page",
                        "pageNumber": page_idx,
@@ -205,6 +295,7 @@ async def run_ocr_stream(request: OcrRequest):
                except Exception:
                    logger.exception("Guided OCR failed on page %d", page_idx)
                    skipped_pages += 1
+                    metrics.ocr_skipped_pages_total.inc()
                    yield json.dumps({
                        "type": "error",
                        "pageNumber": page_idx,
@@ -238,18 +329,25 @@ async def run_ocr_stream(request: OcrRequest):
                yield json.dumps({"type": "preprocessing", "pageNumber": page_idx}) + "\n"
                image = await asyncio.to_thread(preprocess_page, image)
                sender_path = request.senderModelPath if use_kraken else None
+                page_started = time.monotonic()
                blocks = await asyncio.to_thread(
                    engine.extract_page_blocks, image, page_idx, request.language, sender_path
                )
+                metrics.ocr_processing_seconds.labels(engine=engine_name).observe(
+                    time.monotonic() - page_started
+                )

                for block in blocks:
-                    if block.get("words"):
-                        block["text"] = apply_confidence_markers(block["words"], threshold)
+                    words = block.get("words") or []
+                    if words:
+                        _observe_block_words(words, threshold)
+                        block["text"] = apply_confidence_markers(words, threshold)
                    block.pop("words", None)
                    if script_type in _SPELL_CHECK_SCRIPT_TYPES:
                        block["text"] = correct_text(block["text"])

                total_blocks += len(blocks)
+                metrics.ocr_pages_total.labels(engine=engine_name).inc()
                yield json.dumps({
                    "type": "page",
                    "pageNumber": page_idx,
@@ -259,6 +357,7 @@ async def run_ocr_stream(request: OcrRequest):
            except Exception:
                logger.exception("OCR failed on page %d", page_idx)
                skipped_pages += 1
+                metrics.ocr_skipped_pages_total.inc()
                yield json.dumps({
                    "type": "error",
                    "pageNumber": page_idx,
@@ -438,8 +537,7 @@ async def train_model(

            return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs}

-    result = await asyncio.to_thread(_run_training)
-    return result
+    return await _record_training(lambda: asyncio.to_thread(_run_training), kind="recognition")


@app.post("/train-sender")
@@ -518,8 +616,9 @@ async def train_sender_model(

            return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs}

-    result = await asyncio.to_thread(_run_sender_training)
-    return result
+    return await _record_training(
+        lambda: asyncio.to_thread(_run_sender_training), kind="recognition"
+    )


@app.post("/segtrain")
@@ -628,8 +727,7 @@ async def segtrain_model(

            return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs}

-    result = await asyncio.to_thread(_run_segtrain)
-    return result
+    return await _record_training(lambda: asyncio.to_thread(_run_segtrain), kind="segmentation")


 async def _download_and_convert_pdf(url: str) -> list[Image.Image]:
--- a/ocr-service/metrics.py
+++ b/ocr-service/metrics.py
@@ -0,0 +1,92 @@
+"""Prometheus metric definitions for the OCR service.
+
+`build_metrics(registry)` returns a fresh `OcrMetrics` instance bound to the
+given `CollectorRegistry`. Production code calls it once at module load with
+the default `REGISTRY`; tests pass a per-test `CollectorRegistry()` to keep
+counter values isolated between cases (decision #3 on issue #652).
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+from prometheus_client import CollectorRegistry, Counter, Gauge, Histogram
+
+
+@dataclass(frozen=True)
+class OcrMetrics:
+    """Container for every custom OCR metric.
+
+    Counters and gauges are immutable references to `prometheus_client`
+    instances. Mutating them (`.inc()`, `.observe()`, `.set()`) is safe;
+    rebinding the field on the dataclass is not — use `build_metrics` to get
+    a new container.
+    """
+
+    ocr_jobs_total: Counter
+    ocr_pages_total: Counter
+    ocr_skipped_pages_total: Counter
+    ocr_words_total: Counter
+    ocr_illegible_words_total: Counter
+    ocr_processing_seconds: Histogram
+    ocr_training_runs_total: Counter
+    ocr_model_accuracy: Gauge
+    ocr_models_ready: Gauge
+
+
+def build_metrics(registry: CollectorRegistry) -> OcrMetrics:
+    """Create one OcrMetrics instance bound to `registry`."""
+    return OcrMetrics(
+        ocr_jobs_total=Counter(
+            "ocr_jobs_total",
+            "Number of OCR jobs processed, labelled by engine and script type.",
+            ["engine", "script_type"],
+            registry=registry,
+        ),
+        ocr_pages_total=Counter(
+            "ocr_pages_total",
+            "Number of pages successfully OCR'd, labelled by engine.",
+            ["engine"],
+            registry=registry,
+        ),
+        ocr_skipped_pages_total=Counter(
+            "ocr_skipped_pages_total",
+            "Number of pages skipped because the OCR engine raised.",
+            registry=registry,
+        ),
+        ocr_words_total=Counter(
+            "ocr_words_total",
+            "Number of words recognized across all OCR blocks.",
+            registry=registry,
+        ),
+        ocr_illegible_words_total=Counter(
+            "ocr_illegible_words_total",
+            "Number of words below the confidence threshold "
+            "(replaced with [unleserlich]).",
+            registry=registry,
+        ),
+        ocr_processing_seconds=Histogram(
+            "ocr_processing_seconds",
+            "OCR processing time per page (streaming) or per document (non-streaming).",
+            ["engine"],
+            registry=registry,
+        ),
+        ocr_training_runs_total=Counter(
+            "ocr_training_runs_total",
+            "Number of training runs, labelled by kind (recognition|segmentation) "
+            "and outcome (success|error).",
+            ["kind", "outcome"],
+            registry=registry,
+        ),
+        ocr_model_accuracy=Gauge(
+            "ocr_model_accuracy",
+            "Latest model accuracy reported by a successful training run.",
+            ["kind"],
+            registry=registry,
+        ),
+        ocr_models_ready=Gauge(
+            "ocr_models_ready",
+            "1 once the lifespan startup has finished loading models, 0 before.",
+            registry=registry,
+        ),
+    )
--- a/ocr-service/requirements.txt
+++ b/ocr-service/requirements.txt
@@ -10,3 +10,5 @@ pyvips>=2.2.0
 httpx==0.28.1
 pyspellchecker==0.9.0
 opencv-python-headless==4.11.0.86
+prometheus-fastapi-instrumentator==7.0.0
+prometheus-client==0.25.0
--- a/ocr-service/test_metrics.py
+++ b/ocr-service/test_metrics.py
@@ -0,0 +1,638 @@
+"""Tests for Prometheus metrics exposed by the OCR service.
+
+Each test that asserts on a counter/gauge value uses a fresh CollectorRegistry
+(see decision #3 on issue #652) to keep the metrics isolated between tests.
+"""
+
+import contextlib
+import io
+import zipfile
+from unittest.mock import AsyncMock, patch
+
+import pytest
+from httpx import ASGITransport, AsyncClient
+from PIL import Image
+from prometheus_client import CollectorRegistry
+
+from main import app
+from metrics import build_metrics
+
+
+@contextlib.asynccontextmanager
+async def ocr_client(*, raise_app_exceptions: bool = True):
+    """Yield an AsyncClient with model-loaders patched and _models_ready forced on.
+
+    The shared setup for almost every metrics test: stub the heavy lifecycle
+    hooks (kraken_engine.load_models, load_spell_checker), flip the readiness
+    flag so request handlers do not 503, and restore it afterwards.
+    """
+    with patch("main.kraken_engine.load_models"), \
+         patch("main.load_spell_checker"):
+        transport = ASGITransport(app=app, raise_app_exceptions=raise_app_exceptions)
+        async with AsyncClient(transport=transport, base_url="http://test") as client:
+            import main as main_module
+            main_module._models_ready = True
+            try:
+                yield client
+            finally:
+                main_module._models_ready = False
+
+
+def _minimal_zip() -> bytes:
+    """Return a ZIP containing one fake .xml so endpoint validation passes."""
+    buf = io.BytesIO()
+    with zipfile.ZipFile(buf, "w") as zf:
+        zf.writestr("page_01.xml", "<PcGts/>")
+    return buf.getvalue()
+
+
+def _fake_training_result(accuracy: float = 0.91) -> dict:
+    return {"loss": None, "accuracy": accuracy, "cer": round(1 - accuracy, 4), "epochs": 5}
+
+
+@pytest.fixture
+def fresh_metrics(monkeypatch):
+    """Replace the module-level `main.metrics` with one bound to a fresh registry."""
+    registry = CollectorRegistry()
+    test_metrics = build_metrics(registry)
+    monkeypatch.setattr("main.metrics", test_metrics)
+    return test_metrics
+
+
+@pytest.mark.asyncio
+async def test_metrics_endpoint_returns_200():
+    """`GET /metrics` returns 200 with Prometheus exposition content.
+
+    Uses the global REGISTRY by design — does NOT take the `fresh_metrics` fixture.
+    The `/metrics` endpoint is wired by `prometheus-fastapi-instrumentator`, which
+    binds to the default REGISTRY at app-construction time; swapping `main.metrics`
+    via the fixture would not redirect what `/metrics` exposes. This test only
+    asserts response shape (status code + content-type substring), not numeric
+    counter values, so cross-test state leakage cannot affect it.
+    """
+    with patch("main.kraken_engine.load_models"), \
+         patch("main.load_spell_checker"):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            response = await client.get("/metrics")
+
+    assert response.status_code == 200
+    assert "text/plain" in response.headers.get("content-type", "")
+
+
+@pytest.mark.asyncio
+async def test_metrics_includes_http_request_metrics_after_ocr_call():
+    """After a request to /ocr, `/metrics` exposes auto-instrumented http_* metrics.
+
+    Uses the global REGISTRY by design — does NOT take the `fresh_metrics` fixture.
+    The `http_requests_total` / `http_request_duration_seconds` metrics live on
+    the instrumentator's default REGISTRY (not on `main.metrics`), so a fresh
+    CollectorRegistry would never see them. This test only asserts response shape
+    (substring presence in the exposition body), not numeric counter values, so
+    cross-test state leakage cannot affect it.
+    """
+    mock_images = [Image.new("RGB", (100, 100))]
+    mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "hi", "words": []}]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            ocr_response = await client.post("/ocr", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+            })
+            assert ocr_response.status_code == 200, ocr_response.text
+
+            metrics_response = await client.get("/metrics")
+
+    body = metrics_response.text
+    assert "http_requests_total" in body
+    assert "http_request_duration_seconds" in body
+
+
+def test_build_metrics_registers_all_custom_metrics_on_given_registry():
+    """`build_metrics` returns an OcrMetrics bound to the supplied registry."""
+    registry = CollectorRegistry()
+    metrics = build_metrics(registry)
+
+    metric_names = {m.name for m in registry.collect()}
+    expected = {
+        "ocr_jobs",
+        "ocr_pages",
+        "ocr_skipped_pages",
+        "ocr_words",
+        "ocr_illegible_words",
+        "ocr_processing_seconds",
+        "ocr_training_runs",
+        "ocr_model_accuracy",
+        "ocr_models_ready",
+    }
+    assert expected <= metric_names, f"missing: {expected - metric_names}"
+
+    # A second registry yields a separate container — no shared state.
+    other_metrics = build_metrics(CollectorRegistry())
+    assert metrics is not other_metrics
+
+
+async def _drive_ocr(client: AsyncClient, *, script_type: str) -> None:
+    """Helper — fires /ocr with a single mocked page and asserts a 200."""
+    response = await client.post("/ocr", json={
+        "pdfUrl": "http://minio/doc.pdf",
+        "scriptType": script_type,
+        "language": "de",
+    })
+    assert response.status_code == 200, response.text
+
+
+@pytest.mark.asyncio
+async def test_ocr_jobs_total_incremented_with_kraken_engine_label_for_kurrent(fresh_metrics):
+    """A /ocr call with HANDWRITING_KURRENT increments engine=kraken."""
+    mock_images = [Image.new("RGB", (100, 100))]
+    mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "hi", "words": []}]
+
+    with patch("main.correct_text", side_effect=lambda t: t), \
+         patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.kraken_engine.is_available", return_value=True), \
+         patch("main.kraken_engine.extract_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            await _drive_ocr(client, script_type="HANDWRITING_KURRENT")
+
+    value = fresh_metrics.ocr_jobs_total.labels(
+        engine="kraken", script_type="HANDWRITING_KURRENT"
+    )._value.get()
+    assert value == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_jobs_total_incremented_with_surya_engine_label_for_typewriter(fresh_metrics):
+    """A /ocr call with TYPEWRITER increments engine=surya."""
+    mock_images = [Image.new("RGB", (100, 100))]
+    mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "hi", "words": []}]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            await _drive_ocr(client, script_type="TYPEWRITER")
+
+    value = fresh_metrics.ocr_jobs_total.labels(
+        engine="surya", script_type="TYPEWRITER"
+    )._value.get()
+    assert value == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_pages_total_incremented_once_per_page_in_stream(fresh_metrics):
+    """The /ocr/stream generator increments ocr_pages_total per successful page."""
+    mock_images = [Image.new("RGB", (100, 100)) for _ in range(3)]
+    mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "hi", "words": []}]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_page_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            async with client.stream("POST", "/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+            }) as response:
+                assert response.status_code == 200
+                # Drain the stream so all per-page increments fire.
+                async for _ in response.aiter_lines():
+                    pass
+
+    value = fresh_metrics.ocr_pages_total.labels(engine="surya")._value.get()
+    assert value == 3.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_skipped_pages_total_incremented_when_engine_raises_for_a_page(fresh_metrics):
+    """When the engine raises on a page, ocr_skipped_pages_total bumps and the stream finishes."""
+    mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)]
+    good_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "ok", "words": []}]
+
+    call_count = {"n": 0}
+
+    def extract_side_effect(*args, **kwargs):
+        call_count["n"] += 1
+        if call_count["n"] == 1:
+            raise RuntimeError("synthetic engine failure")
+        return good_blocks
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_page_blocks", side_effect=extract_side_effect):
+        async with ocr_client() as client:
+            async with client.stream("POST", "/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+            }) as response:
+                assert response.status_code == 200
+                saw_error = False
+                async for line in response.aiter_lines():
+                    if line and '"type": "error"' in line:
+                        saw_error = True
+                assert saw_error
+
+    assert fresh_metrics.ocr_skipped_pages_total._value.get() == 1.0
+    # The second page still succeeds.
+    assert fresh_metrics.ocr_pages_total.labels(engine="surya")._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_words_and_illegible_words_total_sum_across_blocks(fresh_metrics):
+    """Counters reflect totals summed over every block in the request.
+
+    Threshold defaults to THRESHOLD_DEFAULT (0.3) for non-Kurrent scripts. Two
+    blocks: 3 words above + 2 words below threshold across blocks.
+    """
+    mock_images = [Image.new("RGB", (100, 100))]
+    mock_blocks = [
+        {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+         "polygon": None, "text": "ignored",
+         "words": [{"text": "Lieber", "confidence": 0.9},
+                   {"text": "Freund", "confidence": 0.1}]},
+        {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+         "polygon": None, "text": "ignored",
+         "words": [{"text": "Gruss", "confidence": 0.8},
+                   {"text": "verschmiert", "confidence": 0.05},
+                   {"text": "Karl", "confidence": 0.95}]},
+    ]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            await _drive_ocr(client, script_type="TYPEWRITER")
+
+    assert fresh_metrics.ocr_words_total._value.get() == 5.0
+    assert fresh_metrics.ocr_illegible_words_total._value.get() == 2.0
+
+
+def _histogram_count_sum(histogram, **labels) -> tuple[float, float]:
+    """Read the per-label-set _count and _sum from a prometheus_client Histogram."""
+    child = histogram.labels(**labels)
+    return child._sum.get(), sum(b.get() for b in child._buckets)
+
+
+@pytest.mark.asyncio
+async def test_ocr_processing_seconds_histogram_observed_per_page_in_stream(fresh_metrics):
+    """The streaming generator observes ocr_processing_seconds once per page."""
+    mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)]
+    mock_blocks = [{"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0,
+                    "polygon": None, "text": "ok", "words": []}]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_page_blocks", return_value=mock_blocks):
+        async with ocr_client() as client:
+            async with client.stream("POST", "/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+            }) as response:
+                assert response.status_code == 200
+                async for _ in response.aiter_lines():
+                    pass
+
+    sum_seconds, count = _histogram_count_sum(
+        fresh_metrics.ocr_processing_seconds, engine="surya"
+    )
+    assert count == 2.0
+    assert sum_seconds >= 0.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_training_runs_total_incremented_with_recognition_success_label(fresh_metrics):
+    """/train success increments ocr_training_runs_total{kind=recognition, outcome=success}."""
+    async def fake_to_thread(func, *args, **kwargs):
+        return _fake_training_result()
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True), \
+         patch("main.asyncio.to_thread", side_effect=fake_to_thread):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            response = await client.post(
+                "/train",
+                files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                headers={"X-Training-Token": "secret-token"},
+            )
+
+    assert response.status_code == 200
+    assert fresh_metrics.ocr_training_runs_total.labels(
+        kind="recognition", outcome="success"
+    )._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_training_runs_total_incremented_with_recognition_error_label(fresh_metrics):
+    """When ketos exits non-zero, the error counter bumps and the exception propagates.
+
+    Uses the narrowest available seam — `subprocess.run` returning a failing
+    CompletedProcess — instead of stubbing the asyncio.to_thread boundary,
+    so the test exercises the real _run_training error path.
+    """
+    from subprocess import CompletedProcess
+
+    failing_proc = CompletedProcess(
+        args=["ketos"], returncode=1, stdout="", stderr="synthetic ketos failure"
+    )
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True), \
+         patch("main.subprocess.run", return_value=failing_proc):
+        transport = ASGITransport(app=app, raise_app_exceptions=False)
+        async with AsyncClient(transport=transport, base_url="http://test") as client:
+            response = await client.post(
+                "/train",
+                files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                headers={"X-Training-Token": "secret-token"},
+            )
+
+    assert response.status_code == 500
+    assert fresh_metrics.ocr_training_runs_total.labels(
+        kind="recognition", outcome="error"
+    )._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_training_runs_total_incremented_with_segmentation_success_label(fresh_metrics):
+    """/segtrain success increments ocr_training_runs_total{kind=segmentation, outcome=success}."""
+    async def fake_to_thread(func, *args, **kwargs):
+        return _fake_training_result(accuracy=0.83)
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True), \
+         patch("main.asyncio.to_thread", side_effect=fake_to_thread):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            response = await client.post(
+                "/segtrain",
+                files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                headers={"X-Training-Token": "secret-token"},
+            )
+
+    assert response.status_code == 200
+    assert fresh_metrics.ocr_training_runs_total.labels(
+        kind="segmentation", outcome="success"
+    )._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_training_runs_total_incremented_with_recognition_success_label_for_train_sender(fresh_metrics):
+    """/train-sender success increments ocr_training_runs_total{kind=recognition, outcome=success}."""
+    async def fake_to_thread(func, *args, **kwargs):
+        return _fake_training_result()
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True), \
+         patch("main.asyncio.to_thread", side_effect=fake_to_thread):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            response = await client.post(
+                "/train-sender",
+                files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                data={"output_model_path": "/app/models/sender_test.mlmodel"},
+                headers={"X-Training-Token": "secret-token"},
+            )
+
+    assert response.status_code == 200, response.text
+    assert fresh_metrics.ocr_training_runs_total.labels(
+        kind="recognition", outcome="success"
+    )._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_model_accuracy_gauge_stays_default_when_training_returns_no_accuracy(fresh_metrics):
+    """When the runner returns accuracy=None, ocr_model_accuracy must remain at its default 0."""
+    async def fake_to_thread(func, *args, **kwargs):
+        return {"loss": None, "accuracy": None, "cer": None, "epochs": 5}
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True), \
+         patch("main.asyncio.to_thread", side_effect=fake_to_thread):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            response = await client.post(
+                "/train",
+                files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                headers={"X-Training-Token": "secret-token"},
+            )
+
+    assert response.status_code == 200
+    # Gauge was never .set() — accessing the label child still creates it with default 0.0.
+    assert fresh_metrics.ocr_model_accuracy.labels(
+        kind="recognition"
+    )._value.get() == 0.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_model_accuracy_gauge_set_per_kind_after_successful_training(fresh_metrics):
+    """After /train and /segtrain succeed, ocr_model_accuracy{kind=...} reflects the result."""
+    recognition_accuracy = 0.917
+    segmentation_accuracy = 0.834
+
+    async def fake_recognition_to_thread(func, *args, **kwargs):
+        return _fake_training_result(accuracy=recognition_accuracy)
+
+    async def fake_segmentation_to_thread(func, *args, **kwargs):
+        return _fake_training_result(accuracy=segmentation_accuracy)
+
+    with patch("main.TRAINING_TOKEN", "secret-token"), \
+         patch("main._models_ready", True):
+        async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as client:
+            with patch("main.asyncio.to_thread", side_effect=fake_recognition_to_thread):
+                rec_resp = await client.post(
+                    "/train",
+                    files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                    headers={"X-Training-Token": "secret-token"},
+                )
+            assert rec_resp.status_code == 200
+            with patch("main.asyncio.to_thread", side_effect=fake_segmentation_to_thread):
+                seg_resp = await client.post(
+                    "/segtrain",
+                    files={"file": ("training.zip", _minimal_zip(), "application/zip")},
+                    headers={"X-Training-Token": "secret-token"},
+                )
+            assert seg_resp.status_code == 200
+
+    assert fresh_metrics.ocr_model_accuracy.labels(kind="recognition")._value.get() == pytest.approx(recognition_accuracy)
+    assert fresh_metrics.ocr_model_accuracy.labels(kind="segmentation")._value.get() == pytest.approx(segmentation_accuracy)
+
+
+def test_ocr_models_ready_gauge_defaults_to_zero():
+    """A freshly-built OcrMetrics has ocr_models_ready=0 before lifespan runs."""
+    metrics = build_metrics(CollectorRegistry())
+    assert metrics.ocr_models_ready._value.get() == 0.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_models_ready_gauge_is_one_after_lifespan_startup(fresh_metrics):
+    """The lifespan flips ocr_models_ready to 1 once load_models / load_spell_checker return.
+
+    ASGITransport does not run lifespan by default, so the lifespan context
+    manager is driven directly to exercise the startup code path.
+    """
+    assert fresh_metrics.ocr_models_ready._value.get() == 0.0
+    with patch("main.kraken_engine.load_models"), \
+         patch("main.load_spell_checker"):
+        async with app.router.lifespan_context(app):
+            assert fresh_metrics.ocr_models_ready._value.get() == 1.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_processing_seconds_histogram_observed_per_page_in_guided_stream(fresh_metrics):
+    """The guided streaming generator observes ocr_processing_seconds once per page."""
+    mock_images = [Image.new("RGB", (100, 100)) for _ in range(2)]
+    regions = [
+        {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a1"},
+        {"pageNumber": 2, "x": 0.0, "y": 0.0, "width": 1.0, "height": 1.0, "annotationId": "a2"},
+    ]
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.surya_engine.extract_region_text", return_value="text"):
+        async with ocr_client() as client:
+            async with client.stream("POST", "/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+                "regions": regions,
+            }) as response:
+                assert response.status_code == 200
+                async for _ in response.aiter_lines():
+                    pass
+
+    sum_seconds, count = _histogram_count_sum(
+        fresh_metrics.ocr_processing_seconds, engine="surya"
+    )
+    assert count == 2.0
+    assert sum_seconds >= 0.0
+
+
+@pytest.mark.asyncio
+async def test_ocr_processing_seconds_histogram_excludes_spell_check_time_in_guided_stream(fresh_metrics):
+    """The guided observation must time engine work only, not the spell-check pass.
+
+    Wall-clock bound rather than a structural `patch("main.time.monotonic")`:
+    the patched attribute is the *global* `time.monotonic`, which httpx and
+    asyncio also consume — they exhaust the deterministic sequence before the
+    request reaches the engine loop. Bound is sized against the failure mode,
+    not the noise floor: spell-check sleeps 0.05s × 2 regions = 0.1s, so a
+    timer that accidentally wrapped `correct_text` would observe >= 0.1s. The
+    0.09s ceiling catches that bug while leaving ~90ms of slack for slow CI
+    runners (engine work is instantaneous under the mock).
+    """
+    mock_images = [Image.new("RGB", (100, 100))]
+    regions = [
+        {"pageNumber": 1, "x": 0.0, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a1"},
+        {"pageNumber": 1, "x": 0.5, "y": 0.0, "width": 0.5, "height": 0.5, "annotationId": "a2"},
+    ]
+
+    def slow_correct(text):
+        import time as _time
+        _time.sleep(0.05)
+        return text
+
+    with patch("main._download_and_convert_pdf", new_callable=AsyncMock, return_value=mock_images), \
+         patch("main.preprocess_page", side_effect=lambda img: img), \
+         patch("main.kraken_engine.is_available", return_value=True), \
+         patch("main.kraken_engine.extract_region_text", return_value="text"), \
+         patch("main.correct_text", side_effect=slow_correct):
+        async with ocr_client() as client:
+            async with client.stream("POST", "/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "HANDWRITING_KURRENT",
+                "language": "de",
+                "regions": regions,
+            }) as response:
+                assert response.status_code == 200
+                async for _ in response.aiter_lines():
+                    pass
+
+    sum_seconds, _ = _histogram_count_sum(
+        fresh_metrics.ocr_processing_seconds, engine="kraken"
+    )
+    assert sum_seconds < 0.09, f"timing must exclude spell-check; got sum={sum_seconds}"
+
+
+@pytest.mark.asyncio
+async def test_ocr_jobs_total_not_incremented_when_pdf_download_fails_in_stream(fresh_metrics):
+    """If `_download_and_convert_pdf` raises, ocr_jobs_total is NOT incremented.
+
+    Mirrors the /ocr endpoint's semantics: the counter only records jobs that
+    actually started OCR work, not failed downloads.
+    """
+    async def fail_download(url):
+        raise RuntimeError("synthetic download failure")
+
+    with patch("main._download_and_convert_pdf", new=fail_download):
+        async with ocr_client(raise_app_exceptions=False) as client:
+            response = await client.post("/ocr/stream", json={
+                "pdfUrl": "http://minio/doc.pdf",
+                "scriptType": "TYPEWRITER",
+                "language": "de",
+            })
+
+    assert response.status_code == 500
+    assert fresh_metrics.ocr_jobs_total.labels(
+        engine="surya", script_type="TYPEWRITER"
+    )._value.get() == 0.0
+
+
+def test_uvicorn_access_log_filter_fails_open_on_short_or_missing_args():
+    """The filter must default-allow records when args is None or shorter than expected.
+
+    Locks in fail-open behavior: if uvicorn ever changes its format we keep
+    forwarding records to the handler rather than silently dropping logs.
+    """
+    import logging as _logging
+    from main import MetricsPathFilter
+
+    filt = MetricsPathFilter()
+    none_record = _logging.LogRecord(
+        name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0,
+        msg="some message", args=None, exc_info=None,
+    )
+    short_record = _logging.LogRecord(
+        name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0,
+        msg="%s %s", args=("a", "b"), exc_info=None,
+    )
+
+    assert filt.filter(none_record) is True
+    assert filt.filter(short_record) is True
+
+
+def test_uvicorn_access_log_filter_skips_metrics_path():
+    """The MetricsPathFilter drops uvicorn.access log records that target /metrics."""
+    import logging as _logging
+    from main import MetricsPathFilter
+
+    filt = MetricsPathFilter()
+    metrics_record = _logging.LogRecord(
+        name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0,
+        msg='%s - "%s %s HTTP/%s" %d',
+        args=("127.0.0.1:1234", "GET", "/metrics", "1.1", 200),
+        exc_info=None,
+    )
+    health_record = _logging.LogRecord(
+        name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0,
+        msg='%s - "%s %s HTTP/%s" %d',
+        args=("127.0.0.1:1234", "GET", "/health", "1.1", 200),
+        exc_info=None,
+    )
+    ocr_record = _logging.LogRecord(
+        name="uvicorn.access", level=_logging.INFO, pathname="", lineno=0,
+        msg='%s - "%s %s HTTP/%s" %d',
+        args=("127.0.0.1:1234", "POST", "/ocr", "1.1", 200),
+        exc_info=None,
+    )
+
+    assert filt.filter(metrics_record) is False
+    assert filt.filter(health_record) is False
+    assert filt.filter(ocr_record) is True
Author	SHA1	Message	Date
Marcel	bcba4dab80	ci(observability): inject GRAFANA_DB_PASSWORD from Gitea secrets All checks were successful CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 20s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s Details CI / Unit & Component Tests (pull_request) Successful in 3m32s Details CI / OCR Service Tests (pull_request) Successful in 20s Details CI / Backend Unit Tests (pull_request) Successful in 3m30s Details Wires the new GRAFANA_DB_PASSWORD secret through the deploy pipeline: - docker-compose.prod.yml: backend env now passes GRAFANA_DB_PASSWORD through so Flyway V68 can resolve the ${grafanaDbPassword} placeholder in production and staging (it already worked in local dev via docker-compose.yml). - release.yml + nightly.yml: declare GRAFANA_DB_PASSWORD as a required Gitea secret, write it into .env.production / .env.staging (consumed by archive-backend), and into /opt/familienarchiv/obs-secrets.env (consumed by obs-grafana's PostgreSQL datasource). Operator action before the next deploy: add a GRAFANA_DB_PASSWORD value to the Gitea repo secrets (openssl rand -hex 32). Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:27 +02:00
Marcel	a4a3e3b105	docs(architecture): show Grafana→PostgreSQL link for PO Overview dashboard Adds the new read-only connection from Grafana to archive-db (via the grafana_reader role) introduced by the PO Overview dashboard. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	cac00ed711	docs(deployment): document GRAFANA_DB_PASSWORD across env tables Adds GRAFANA_DB_PASSWORD to the observability-stack env-var table, the Gitea secrets table, and the obs-secrets.env reference, so operators see the variable wherever they look for related secrets. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	637829cebc	feat(observability): add PO Overview Grafana dashboard Provisioned dashboard for the product owner's weekly check-in: system health (Prometheus + Loki), user activity (PostgreSQL audit_log), archive progress (PostgreSQL transcription_blocks + audit_log), and OCR quality (Prometheus ocr-service metrics). Default range 7d, manual refresh, thresholds per the issue spec. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	4e636b3253	chore(observability): document GRAFANA_DB_PASSWORD in env files .env.example: declare GRAFANA_DB_PASSWORD with an openssl rand -hex 32 hint so a missing value fails loudly (NFR-OPS-02). obs.env: add a comment explaining that the real value comes from CI's obs-secrets.env, matching the pattern used for other secrets in that file. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	ab2708e63b	feat(observability): provision Grafana PostgreSQL datasource Adds a read-only datasource pointing at archive-db using the grafana_reader role (provisioned by Flyway V68). The password is interpolated from the GRAFANA_DB_PASSWORD env var passed to obs-grafana, and the connection is locked to editable: false so the credential cannot be inspected via the UI. sslmode=disable is intentional: traffic stays inside archiv-net. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	ed8e9576e4	feat(observability): pass GRAFANA_DB_PASSWORD to archive-backend Flyway runs inside the backend container at startup; V68's ${grafanaDbPassword} placeholder is resolved from this env var. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	0958df7768	feat(observability): wire obs-grafana to archive-db and inject GRAFANA_DB_PASSWORD obs-grafana now joins archiv-net so it can resolve archive-db:5432 for the PO Overview dashboard's PostgreSQL datasource, and receives GRAFANA_DB_PASSWORD so provisioning can interpolate it into the datasource config. Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	f4ffd8acee	feat(observability): create grafana_reader read-only DB role Add Flyway V68 migration that provisions a read-only PostgreSQL role scoped to audit_log, documents, and transcription_blocks. The role's password is injected via the new ${grafanaDbPassword} Flyway placeholder, which FlywayConfig reads from the GRAFANA_DB_PASSWORD env var. The migration is idempotent: CREATE on first run, ALTER on re-run. Adds a Testcontainers integration test asserting positive grants on the three intended tables and a negative grant on app_users (NFR-SEC-01). Refs #651. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 20:21:05 +02:00
Marcel	0801da8df0	docs(ocr): explain why two metrics tests skip fresh_metrics fixture Some checks failed CI / Backend Unit Tests (push) Successful in 3m42s Details CI / fail2ban Regex (push) Successful in 43s Details CI / Semgrep Security Scan (push) Successful in 19s Details CI / Compose Bucket Idempotency (push) Successful in 1m0s Details CI / Unit & Component Tests (pull_request) Successful in 3m24s Details CI / OCR Service Tests (pull_request) Successful in 20s Details CI / Backend Unit Tests (pull_request) Successful in 3m28s Details CI / fail2ban Regex (pull_request) Successful in 43s Details CI / Semgrep Security Scan (pull_request) Successful in 19s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details CI / Unit & Component Tests (push) Failing after 2m44s Details CI / OCR Service Tests (push) Successful in 20s Details Sara's cycle-2 S2: clarify the latent (but not actual) cross-test state risk on the two metrics tests that hit the global REGISTRY instead of the per-test fresh_metrics fixture. Migrating them would actually break them — the /metrics endpoint is served by prometheus-fastapi-instrumentator which binds to the default REGISTRY at app-construction time, and the http_requests_total assertion only finds counters on that global registry. Both tests already assert response shape only (status code, content-type substring, body substrings), not numeric values, so the shared-registry caveat is documented for future readers rather than treated as a bug to fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 17:23:32 +02:00
Marcel	e0e1578bdd	test(ocr): widen spell-check exclusion bound to 0.09s with rationale Sara's cycle-2 S1: the wall-clock assertion at < 0.05s could trip on a slow CI runner under load even when the timer correctly excludes spell-check. Sara's preferred structural fix (patch main.time.monotonic with a deterministic sequence) proved awkward — the patched attribute is the global time.monotonic which httpx and asyncio consume, exhausting the sequence before the request reaches the engine loop. Take the documented fallback: widen the bound to 0.09s and explain why. The failure mode the test guards against (spell-check inside the timer) would add 0.1s (2 × 0.05s sleep), so 0.09s catches the bug while leaving ~90ms of headroom for slow CI runners. Verified red→green by temporarily moving correct_text inside the timer block: bound trips at 0.101s; the fixed code reads ~0.001s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 17:22:49 +02:00
Marcel	2df71beb7e	docs: add ADR-023 and glossary entries for OCR metrics All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m33s Details CI / OCR Service Tests (pull_request) Successful in 22s Details CI / Backend Unit Tests (pull_request) Successful in 3m29s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 20s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details ADR-023 captures why prometheus-fastapi-instrumentator was chosen, the build_metrics(registry) factory pattern, and the test rebinding seam. The glossary gains four ops-aligned terms — illegible word, models-ready gauge, recognition vs segmentation accuracy — so the metrics documentation in OBSERVABILITY.md has a vocabulary to lean on. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:06:44 +02:00
Marcel	2dbb3c37b4	docs(observability): document ocr metrics, scrape edge, and access-log filter - L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge (plus the previously-undrawn Prometheus -> backend edge for symmetry). - OBSERVABILITY.md gains a full ocr_* metrics table with labels, units, and the canonical example queries from issue #652. - New "Internal-only endpoints" subsection captures the unauthenticated /metrics caveat and provides the Caddy block snippet for the case where the service ever gets a host port. - Explicit note that MetricsPathFilter only quiets uvicorn stdout, and the OCR metrics must never carry PII or document content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:05:27 +02:00
Marcel	67368b4413	docs(ocr): annotate metrics binding + /metrics exposure + pin client Three small drops that pay back later: - Note that main.metrics is import-time bound and tests must monkeypatch `main.metrics`, not the registry. - Flag the /metrics endpoint as unauthenticated and cross-link the Caddy-block snippet in docs/OBSERVABILITY.md. - Pin prometheus-client to the exact 0.25.0 patch version already resolved by prometheus-fastapi-instrumentator 7.0.0, so an upstream bump cannot silently slip in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:04:28 +02:00
Marcel	ddf6cf4cbc	test(ocr): collapse shared client setup into ocr_client helper Each metrics test was repeating the same five-line block — patch kraken_engine.load_models, patch load_spell_checker, instantiate the AsyncClient, force _models_ready True, restore it. Lift the lot into a single async context manager so each test body shrinks to its real arrange / act / assert intent. Tests that drive the lifespan directly (models_ready gauge) or stub asyncio.to_thread for /train (which already patches _models_ready) stay unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 17:03:29 +02:00
Marcel	df952861c4	refactor(ocr): extract _record_training for shared metric bookkeeping The /train, /train-sender, and /segtrain endpoints each duplicated the same eight-line try/except + counter + gauge block around the asyncio.to_thread call. Lift it into _record_training(runner, kind), which accepts a sync- or async-returning callable for flexibility. Each endpoint now ends with a single return line. Behaviour preserved — status codes, error propagation, and metric labels stay identical. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:58:40 +02:00
Marcel	22a5ee816a	refactor(ocr): extract _observe_block_words for word counter sites The two block-iteration loops (/ocr and /ocr/stream's standard generator) both ran the same word-total and illegible-word increments. Lift them into a single helper so each call site becomes one line and the counter intent reads cleanly. Pure refactor — no behavior change, tests stay green. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:57:18 +02:00
Marcel	0179e93a4b	test(ocr): narrow training error test to subprocess.run seam The asyncio.to_thread patch stubbed out the entire _run_training call, hiding the real error path. Replacing it with a failing CompletedProcess from subprocess.run exercises the actual ketos-failed branch and keeps the test's intent — error counter bumps, 500 surfaces — intact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:55:14 +02:00
Marcel	0fc0cbcffd	test(ocr): lock in MetricsPathFilter fail-open behavior If uvicorn's access log format ever changes (args=None, or shorter than 3 elements), the filter must keep forwarding records rather than silently dropping them. Two extra LogRecords cover both edge cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:54:24 +02:00
Marcel	549cb15845	test(ocr): cover /train-sender counter and accuracy=None gauge default Two regression tests: - /train-sender hitting the success path bumps the recognition counter (previously only /train and /segtrain were covered). - A successful run whose result.accuracy is None must not call set() on ocr_model_accuracy — the gauge stays at its default 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:48 +02:00
Marcel	74ddf16b01	feat(ocr): time only engine work in guided stream histogram Previously the guided generator's page_started timer wrapped the entire region loop including the synchronous correct_text() call, inflating ocr_processing_seconds with spell-check latency. Sum the per-region engine.extract_region_text durations instead so the histogram matches the unguided stream's "engine only" semantic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:53:04 +02:00
Marcel	ebaedb1af0	test(ocr): assert ocr_jobs_total stays zero when stream download fails Locks in the post-download placement of the counter increment so a regression that moves it back above _download_and_convert_pdf would fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:51:23 +02:00
Marcel	e75ac8ec45	ops(observability): drop TODO from ocr-service scrape job in prometheus.yml All checks were successful CI / Backend Unit Tests (pull_request) Successful in 3m27s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 18s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s Details CI / Unit & Component Tests (pull_request) Successful in 3m24s Details CI / OCR Service Tests (pull_request) Successful in 20s Details The TODO was a placeholder for this work — the OCR service now exposes /metrics so the target will flip from DOWN to UP on next image rebuild. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:16:51 +02:00
Marcel	525f091b3a	feat(ocr): suppress uvicorn access logs for /metrics and /health Adds a logging.Filter on uvicorn.access that drops records whose request path is /metrics or /health. Each is hit on a tight schedule (Prometheus scrape interval and Docker healthcheck), so unfiltered they dominate the access log without carrying any information about real traffic. Refs #652 (Nora's recommendation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:16:14 +02:00
Marcel	d6abf990c7	feat(ocr): flip ocr_models_ready to 1 once the lifespan startup finishes Mirrors the existing _models_ready bool so Prometheus has a time-series liveness/readiness signal for future alerting rules (e.g. ocr_models_ready < 1 for 2m). Refs #652 (AC7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:15:11 +02:00
Marcel	77d59c5d83	test(ocr): assert ocr_model_accuracy gauge is set per kind on success Hits /train then /segtrain through the same test, each with a distinct mocked accuracy, and asserts the labelled gauges reflect the two values. Locks down the kind-label separation between recognition and segmentation accuracy (decision #2). Refs #652 (AC6) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:13:05 +02:00
Marcel	6c2b9af10b	feat(ocr): record training runs in ocr_training_runs_total per kind and outcome Wraps the await asyncio.to_thread(_run_*) calls in /train, /train-sender, and /segtrain with try/except. Recognition training (/train, /train-sender) shares kind="recognition"; /segtrain uses kind="segmentation". The ocr_model_accuracy gauge is set per kind on success. Refs #652 (AC6, decision #2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:12:26 +02:00
Marcel	2e3744d9ef	feat(ocr): observe ocr_processing_seconds around engine.to_thread calls Wraps every asyncio.to_thread(engine.extract_*) call with time.monotonic() deltas in /ocr (per document) and in both /ocr/stream generators (per page). Streaming buckets are the useful operational signal; the non-streaming observation is a bonus. Refs #652 (AC5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:09:25 +02:00
Marcel	131ed336bc	feat(ocr): count words and illegible words at the OCR call sites Walks block["words"] before apply_confidence_markers strips the list, then increments ocr_words_total by len(words) and ocr_illegible_words_total by the count below threshold. Same pattern in both /ocr and /ocr/stream so the ratio illegible/words is a faithful quality signal across endpoints. Refs #652 (AC4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:07:59 +02:00
Marcel	3fa3460dbf	feat(ocr): increment ocr_skipped_pages_total on per-page engine failure Bumps the counter in both /ocr/stream except blocks (standard and guided generators) so the existing skipped_pages local variable now also flows into Prometheus. Refs #652 (AC3b) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:06:50 +02:00
Marcel	79edb94558	feat(ocr): increment ocr_pages_total per successful page in stream Bumps the counter inside both the standard and guided /ocr/stream generators after a page yields its blocks, before the per-page json line is emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after engine selection so the counter still fires when a page later errors out. Refs #652 (AC3a) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:05:36 +02:00
Marcel	52d8dc2b20	test(ocr): assert ocr_jobs_total label is engine=surya for typewriter Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that sets engine_name from script_type now has explicit coverage for both HANDWRITING_KURRENT → kraken and TYPEWRITER → surya. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:04:20 +02:00
Marcel	696b71da5a	feat(ocr): increment ocr_jobs_total with engine and script_type labels Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise, then increment after the blocks have been extracted. Refs #652 (AC2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:03:37 +02:00
Marcel	f3e3545d06	feat(ocr): add metrics.py factory with test-scoped CollectorRegistry support Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and exposes a `build_metrics(registry)` factory. Production main.py binds against the default REGISTRY; tests construct a fresh CollectorRegistry per case and monkeypatch main.metrics, so counter values stay isolated between tests (decision #3 on issue #652, Option A). Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:02:20 +02:00
Marcel	4bb6685edb	test(ocr): assert http_* metrics appear after an /ocr request Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing http_requests_total and http_request_duration_seconds for application traffic, not just register the /metrics endpoint. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 16:00:33 +02:00
Marcel	18c93d4eaa	feat(ocr): expose /metrics endpoint via prometheus-fastapi-instrumentator Mount the instrumentator immediately after FastAPI app creation, excluding /health and /metrics from request metrics to keep http_requests_total focused on real application traffic. Refs #652 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-21 15:59:37 +02:00