- L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge (plus the previously-undrawn Prometheus -> backend edge for symmetry). - OBSERVABILITY.md gains a full ocr_* metrics table with labels, units, and the canonical example queries from issue #652. - New "Internal-only endpoints" subsection captures the unauthenticated /metrics caveat and provides the Caddy block snippet for the case where the service ever gets a host port. - Explicit note that MetricsPathFilter only quiets uvicorn stdout, and the OCR metrics must never carry PII or document content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
248 lines
11 KiB
Markdown
248 lines
11 KiB
Markdown
# Observability Guide
|
||
|
||
> **Ops reference (starting the stack, env vars, CI wiring) → [DEPLOYMENT.md §4](./DEPLOYMENT.md#4-logs--observability).**
|
||
> This file is for developers: what signal lives where, how to reach it, and what to look for.
|
||
|
||
## Where to look for what
|
||
|
||
| I want to… | Go to |
|
||
|---|---|
|
||
| See the last N log lines from the backend | `docker compose logs --tail=100 backend` |
|
||
| Search logs by keyword across time | Grafana → Explore → Loki |
|
||
| Understand why an HTTP request failed | Grafana → Explore → Loki → filter by `traceId` → follow link to Tempo |
|
||
| See a full distributed trace (DB queries, HTTP calls) | Grafana → Explore → Tempo → search by service or trace ID |
|
||
| Check JVM heap / GC / thread count | Grafana → Dashboards → Spring Boot Observability |
|
||
| Check HTTP error rate or p95 latency | Grafana → Dashboards → Spring Boot Observability |
|
||
| Check host CPU / memory / disk | Grafana → Dashboards → Node Exporter Full |
|
||
| See grouped application errors with stack traces | GlitchTip |
|
||
| Check if the backend is healthy | `curl http://localhost:8081/actuator/health` (on the server) |
|
||
| Check what Prometheus is scraping | `curl http://localhost:9090/api/v1/targets` (on the server) |
|
||
|
||
## Access
|
||
|
||
| Tool | External URL | Who it's for |
|
||
|---|---|---|
|
||
| Grafana | `https://grafana.archiv.raddatz.cloud` | Logs, metrics, traces — the primary observability UI |
|
||
| GlitchTip | `https://glitchtip.archiv.raddatz.cloud` | Grouped errors with stack traces and release tracking |
|
||
|
||
Loki, Tempo, and Prometheus have no external URL. They are internal services, accessible only through Grafana (or via SSH tunnel — see below).
|
||
|
||
## Logs (Loki)
|
||
|
||
Logs reach Loki via Promtail, which reads all Docker container logs from the Docker socket and ships them with labels derived from Docker Compose metadata.
|
||
|
||
### Labels available in every log line
|
||
|
||
| Label | What it contains | Example |
|
||
|---|---|---|
|
||
| `job` | Compose service name | `backend`, `frontend`, `db` |
|
||
| `compose_service` | Same as `job` | `backend` |
|
||
| `compose_project` | Compose project name | `archiv-staging`, `archiv-production` |
|
||
| `container_name` | Docker container name | `archiv-staging-backend-1` |
|
||
| `filename` | Docker log source | `/var/lib/docker/containers/…` |
|
||
|
||
**Use `job` in LogQL queries** — it is stable across dev, staging, and production. `container_name` changes between environments.
|
||
|
||
### Common LogQL queries in Grafana Explore
|
||
|
||
```logql
|
||
# All backend logs
|
||
{job="backend"}
|
||
|
||
# Backend ERROR and WARN lines only
|
||
{job="backend"} |= "ERROR" or {job="backend"} |= "WARN"
|
||
|
||
# All logs for a specific request (paste a traceId from a log line)
|
||
{job="backend"} |= "3fa85f64-5717-4562-b3fc-2c963f66afa6"
|
||
|
||
# Log lines containing a specific exception class
|
||
{job="backend"} |~ "DomainException|NullPointerException"
|
||
|
||
# Frontend logs
|
||
{job="frontend"}
|
||
|
||
# Database (slow query log, if enabled)
|
||
{job="db"}
|
||
```
|
||
|
||
### Log → Trace correlation
|
||
|
||
Spring Boot writes the active `traceId` into every log line when a request is being processed:
|
||
|
||
```
|
||
2026-05-16 ... INFO [Familienarchiv,3fa85f64...,1b2c3d4e] o.r.f.document.DocumentService : ...
|
||
```
|
||
|
||
In Grafana Explore → Loki, log lines with a `traceId` field show a **Tempo** link. Clicking it opens the full trace in Explore → Tempo without copying and pasting IDs.
|
||
|
||
This linking is configured in the Loki datasource provisioning via the `traceId` derived field regex. No manual setup required.
|
||
|
||
## Traces (Tempo)
|
||
|
||
The backend sends traces to Tempo via OTLP HTTP (port 4318). Every inbound HTTP request and every JPA query produces a span. Spans are linked into traces by the propagated `traceId` header.
|
||
|
||
### Finding a trace in Grafana
|
||
|
||
**Option A — from a log line:**
|
||
1. Grafana → Explore → select *Loki* datasource
|
||
2. Query `{job="backend"}` and find the failing request
|
||
3. Click the **Tempo** link in the log line (appears when `traceId` is present)
|
||
|
||
**Option B — by service:**
|
||
1. Grafana → Explore → select *Tempo* datasource
|
||
2. Query type: **Search**
|
||
3. Service name: `familienarchiv-backend`
|
||
4. Filter by HTTP status, duration, or operation name as needed
|
||
|
||
**Option C — by trace ID:**
|
||
1. Grafana → Explore → select *Tempo* datasource
|
||
2. Query type: **TraceQL** or **Trace ID**
|
||
3. Paste the trace ID
|
||
|
||
### What each span type tells you
|
||
|
||
| Root span name pattern | What it covers |
|
||
|---|---|
|
||
| `GET /api/documents`, `POST /api/documents` | Full HTTP request lifecycle |
|
||
| `SELECT archiv.*` | A single JPA/JDBC query inside that request |
|
||
| `HikariPool.getConnection` | Connection pool wait time |
|
||
|
||
A slow `SELECT` span inside an otherwise fast HTTP span pinpoints a missing index. A slow `HikariPool.getConnection` span indicates connection pool exhaustion.
|
||
|
||
### Sampling rate
|
||
|
||
- **Dev**: 100% of requests are traced (`management.tracing.sampling.probability: 1.0` in `application.yaml`)
|
||
- **Staging / Production**: 10% (`MANAGEMENT_TRACING_SAMPLING_PROBABILITY=0.1` in `docker-compose.prod.yml`)
|
||
|
||
To find a trace for a specific request in staging/production, either increase the sampling rate temporarily or trigger the request multiple times.
|
||
|
||
## Metrics (Prometheus → Grafana)
|
||
|
||
Prometheus scrapes two targets every 15 s:
|
||
|
||
```
|
||
Target: backend:8081/actuator/prometheus
|
||
Labels: job="spring-boot", application="Familienarchiv"
|
||
|
||
Target: ocr:8000/metrics
|
||
Labels: job="ocr-service"
|
||
```
|
||
|
||
All Spring Boot metrics carry the `application="Familienarchiv"` tag, which is how the Grafana Spring Boot Observability dashboard (ID 17175) filters to this service.
|
||
|
||
### Useful Prometheus queries (run on the server or via Grafana Explore → Prometheus)
|
||
|
||
```promql
|
||
# HTTP error rate (5xx) as a fraction of all requests
|
||
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
|
||
/ sum(rate(http_server_requests_seconds_count[5m]))
|
||
|
||
# p95 response time
|
||
histogram_quantile(0.95, sum by (le) (
|
||
rate(http_server_requests_seconds_bucket[5m])
|
||
))
|
||
|
||
# JVM heap used
|
||
jvm_memory_used_bytes{area="heap", application="Familienarchiv"}
|
||
|
||
# Active DB connections
|
||
hikaricp_connections_active
|
||
```
|
||
|
||
### OCR-service custom metrics
|
||
|
||
Exposed at `ocr:8000/metrics` by `prometheus-fastapi-instrumentator`. The
|
||
`http_*` metrics describe the FastAPI request layer; the `ocr_*` series are
|
||
domain-specific. **Never label these with PII or document content** — labels
|
||
have unbounded cardinality risk and are visible to anyone with Grafana access.
|
||
|
||
| Metric | Type | Labels | Unit | What it tracks |
|
||
|---|---|---|---|---|
|
||
| `ocr_jobs_total` | Counter | `engine` (`surya`/`kraken`), `script_type` | jobs | OCR jobs that started after a successful PDF download |
|
||
| `ocr_pages_total` | Counter | `engine` | pages | Successfully OCR'd pages in the streaming generator |
|
||
| `ocr_skipped_pages_total` | Counter | — | pages | Pages skipped because the engine raised on them |
|
||
| `ocr_words_total` | Counter | — | words | Recognized words summed across every block |
|
||
| `ocr_illegible_words_total` | Counter | — | words | Words below the confidence threshold (rendered as `[unleserlich]`) |
|
||
| `ocr_processing_seconds` | Histogram | `engine` | seconds | Per-page (stream) or per-document (`/ocr`) engine time, excluding preprocessing |
|
||
| `ocr_training_runs_total` | Counter | `kind` (`recognition`/`segmentation`), `outcome` (`success`/`error`) | runs | Completed training runs |
|
||
| `ocr_model_accuracy` | Gauge | `kind` | ratio (0–1) | Latest accuracy reported by a successful training run |
|
||
| `ocr_models_ready` | Gauge | — | 0\|1 | 1 once the lifespan startup has finished loading models |
|
||
|
||
Canonical example queries (the same ones referenced in issue #652):
|
||
|
||
```promql
|
||
# OCR throughput by engine
|
||
sum by (engine) (rate(ocr_pages_total[5m]))
|
||
|
||
# Share of words rendered as [unleserlich]
|
||
sum(rate(ocr_illegible_words_total[5m]))
|
||
/ sum(rate(ocr_words_total[5m]))
|
||
|
||
# p95 page processing time per engine
|
||
histogram_quantile(0.95, sum by (engine, le) (
|
||
rate(ocr_processing_seconds_bucket[5m])
|
||
))
|
||
|
||
# Training error rate
|
||
sum(rate(ocr_training_runs_total{outcome="error"}[1h]))
|
||
/ sum(rate(ocr_training_runs_total[1h]))
|
||
|
||
# Latest recognition vs segmentation accuracy
|
||
ocr_model_accuracy
|
||
```
|
||
|
||
### Internal-only endpoints
|
||
|
||
`/metrics` is exposed by the OCR service over plain HTTP without
|
||
authentication. The container is reachable only on the internal Docker
|
||
network — Caddy never proxies to it directly. If the service is ever
|
||
exposed (e.g. a `ports:` mapping is added), block the endpoint at the
|
||
reverse proxy:
|
||
|
||
```caddy
|
||
ocr.example.com {
|
||
@internal_only path /metrics /health
|
||
respond @internal_only 404
|
||
reverse_proxy ocr:8000
|
||
}
|
||
```
|
||
|
||
The `MetricsPathFilter` in `ocr-service/main.py` suppresses uvicorn's
|
||
**stdout** access log lines for `/metrics` and `/health` so the container
|
||
console stays focused on real OCR traffic. Promtail/Loki still receive
|
||
access lines from any other source. Treat the filter as console
|
||
noise-control, not an audit-suppression mechanism.
|
||
|
||
## Errors (GlitchTip)
|
||
|
||
GlitchTip receives errors from both the backend (via Sentry Java SDK) and the frontend (via Sentry JavaScript SDK). It groups events by fingerprint, tracks first/last seen times, and links to the release that introduced the error.
|
||
|
||
GlitchTip complements Loki: use GlitchTip when you need **grouped, de-duplicated errors with stack traces and release attribution**; use Loki when you need **raw log lines with full context** or want to search across all log levels.
|
||
|
||
## Direct API access (debugging only)
|
||
|
||
Loki and Tempo bind no host ports. To reach them directly from your laptop, use an SSH tunnel through the server:
|
||
|
||
```bash
|
||
# Loki API on localhost:3100 (then query via curl or logcli)
|
||
ssh -L 3100:172.20.0.x:3100 root@raddatz.cloud
|
||
# Replace 172.20.0.x with the obs-loki container IP:
|
||
# docker inspect obs-loki --format '{{.NetworkSettings.Networks.archiv-obs-net.IPAddress}}'
|
||
|
||
# Tempo API on localhost:3200 (then query via curl or tempo-cli)
|
||
ssh -L 3200:172.20.0.x:3200 root@raddatz.cloud
|
||
```
|
||
|
||
In practice, Grafana Explore covers all common debugging workflows without needing direct API access.
|
||
|
||
## Signal summary
|
||
|
||
| Signal | Source | Transport | Storage | UI |
|
||
|---|---|---|---|---|
|
||
| Application logs | Spring Boot stdout → Docker log driver | Promtail → Loki push API | Loki | Grafana Explore → Loki |
|
||
| Distributed traces | Spring Boot OTel agent | OTLP HTTP → Tempo:4318 | Tempo | Grafana Explore → Tempo |
|
||
| JVM + HTTP metrics | Spring Actuator `/actuator/prometheus` | Prometheus pull (15 s) | Prometheus | Grafana dashboards |
|
||
| Host metrics | node-exporter | Prometheus pull | Prometheus | Grafana → Node Exporter Full |
|
||
| Container metrics | cAdvisor | Prometheus pull | Prometheus | Grafana (via Prometheus datasource) |
|
||
| Application errors | Sentry SDK | HTTP POST → GlitchTip ingest | GlitchTip DB | GlitchTip UI |
|