From c0d034c85d2a7b2f7bfa681493a487011b0cfedf Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 13:47:27 +0200
Subject: [PATCH 01/15] =?UTF-8?q?docs(adr):=20add=20ADR-028=20=E2=80=94=20?=
 =?UTF-8?q?Ollama=20Docker=20Compose=20service=20for=20NL=20search=20(#737?=
 =?UTF-8?q?)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/adr/028-ollama-docker-compose-service.md | 227 ++++++++++++++++++
 1 file changed, 227 insertions(+)
 create mode 100644 docs/adr/028-ollama-docker-compose-service.md

diff --git a/docs/adr/028-ollama-docker-compose-service.md b/docs/adr/028-ollama-docker-compose-service.md
new file mode 100644
index 00000000..d1e7f41e
--- /dev/null
+++ b/docs/adr/028-ollama-docker-compose-service.md
@@ -0,0 +1,227 @@
+# ADR-028: Ollama Docker Compose service for NL search
+
+**Date:** 2026-06-06
+**Status:** Accepted
+**Deciders:** Marcel Raddatz
+**Relates to:** #737 (infrastructure), #735 (NL search epic)
+
+---
+
+## Context
+
+Issue #735 introduces natural-language document search, requiring a local LLM to generate embeddings and/or run inference at query time. The family archive stores personal family history — data privacy is non-negotiable, so cloud-based inference APIs are excluded. The production target is a Hetzner CX42 (16 GB RAM, 8 vCPUs, CPU-only, ~32 EUR/month).
+
+Alternatives considered:
+
+| Option | Reason rejected |
+|---|---|
+| **llama.cpp** | No HTTP API out of the box; requires custom wrapper; higher ops burden |
+| **vLLM** | GPU-first; significant overhead on CPU-only hardware; overkill for this scale |
+| **Cloud APIs** (OpenAI, Gemini, etc.) | Vendor lock-in; per-token cost at scale; data leaves the server — unacceptable for a private family archive |
+| **Ollama** | Self-contained Docker image; built-in HTTP REST API; actively maintained; CPU-compatible; zero egress |
+
+**Decision:** run Ollama as a Docker Compose service alongside the existing stack.
+
+---
+
+## Decisions
+
+### 1. Hardware minimums and CPU-only constraint
+
+All inference runs on CPU. The target is the Hetzner CX42 (16 GB RAM, 8 vCPUs).
+
+| Tier | RAM | NL search |
+|---|---|---|
+| CX42 | 16 GB | Supported — full stack including Ollama |
+| CX32 | 8 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) to skip Ollama entirely |
+| CX22 | 4 GB | Unsupported for NL search |
+
+### 2. Memory budget on CX42
+
+| Component | `mem_limit` | Typical active RSS |
+|---|---|---|
+| OCR service | 12g (hard ceiling) | ~6 GB |
+| Ollama | 8g | ~8 GB |
+| **Total** | | **~14 GB active** |
+
+`memswap_limit` on the Ollama service is set to `8g` (matching `mem_limit`) to prevent Linux from swapping model weights into swap under OCR memory pressure. Swapping model weights does not crash the container but silently degrades inference latency. This mirrors the pattern already applied to the OCR service.
+
+**Operational constraint:** do NOT run `docker-compose.observability.yml` continuously alongside both OCR and Ollama on a CX42. The observability stack adds ~2 GB, which leaves no headroom.
+
+### 3. Graceful-degradation contract
+
+`app.ollama.base-url` absent OR blank → Ollama bean NOT registered → NL search returns HTTP 503 with `ErrorCode: NL_SEARCH_UNAVAILABLE`.
+
+This single code path covers all unavailability scenarios: base-url unset, service unreachable, health check failed, and request timeout.
+
+#### Why not `@ConditionalOnProperty`
+
+`@ConditionalOnProperty` registers the bean when the property is present but blank (`APP_OLLAMA_BASE_URL=`). This produces a `RestClient` with an empty base URL that fails at runtime with an opaque error rather than a clean 503.
+
+#### Correct condition expression
+
+```java
+@ConditionalOnExpression("!'${app.ollama.base-url:}'.isBlank()")
+```
+
+When the property is absent, the placeholder resolves to `''`; `.isBlank()` returns `true`; negation makes the condition `false`; the bean is not registered. Same result for an explicit empty string (`APP_OLLAMA_BASE_URL=`).
+
+### 4. Backend configuration pattern
+
+Use a `@ConfigurationProperties` record, not separate `@Value` injections:
+
+```java
+@ConfigurationProperties("app.ollama")
+record OllamaProperties(String baseUrl, String apiKey) {}
+```
+
+`OllamaProperties` is registered unconditionally — it is a plain value holder with no side effects.
+
+`@ConditionalOnExpression` belongs **only** on `RestClientOllamaClient` (the bean that creates a live network client).
+
+**Deliberate divergence from the OCR pattern:** the OCR service uses `@Value`-with-default because OCR is always-on and `http://ocr-service:8000` is a safe default. Ollama is truly optional — a missing URL means "feature disabled", not "use this default server". There is no safe default Ollama URL.
+
+### 5. Optional<OllamaClient> injection
+
+The NL search service uses constructor injection with `Optional<OllamaClient>`:
+
+```java
+private final Optional<OllamaClient> ollamaClient;
+```
+
+When empty (bean not registered), the service method returns 503 immediately:
+
+```java
+var client = ollamaClient.orElseThrow(
+    () -> DomainException.internal(ErrorCode.NL_SEARCH_UNAVAILABLE, "Ollama not configured"));
+```
+
+Prefer this over `@Autowired(required = false)` with a null check — the null-check pattern is noisy when the service already uses `@RequiredArgsConstructor`.
+
+### 6. Empty API key guard
+
+`RestClientOllamaClient` omits the `Authorization` header entirely when `apiKey` is blank:
+
+```java
+if (!apiKey.isBlank()) {
+    request.header("Authorization", "Bearer " + apiKey);
+}
+```
+
+Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
+
+### 7. OLLAMA_API_KEY empty-string behavior
+
+**TBD:** Empirical verification pending on Ollama 0.6.5.
+
+Unknown: whether `OLLAMA_API_KEY=` (explicit empty string) is treated as "no auth" (unauthenticated requests accepted) or "invalid key" (all requests rejected). Both the empty-string and fully-unset cases must be tested.
+
+If empty-string rejects requests, the `.env.example` comment "Leave empty to run unauthenticated" must be corrected and this ADR updated.
+
+**Action item:** run empirical test (`OLLAMA_API_KEY=` vs `# OLLAMA_API_KEY` in env) and record result before merging PR.
+
+### 8. read_only: true feasibility
+
+**TBD:** Investigation pending on Ollama 0.6.5.
+
+Test command:
+```bash
+docker run --rm --read-only \
+  -v ollama_models:/root/.ollama \
+  --tmpfs /tmp \
+  ollama/ollama:0.6.5 \
+  sh -c "ollama serve & sleep 3 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
+```
+
+All three operations (serve, pull, list) must pass to confirm no hidden write paths. Ollama may write to `/root/.config/ollama`, `/var/run`, or `/tmp/ollama*`.
+
+- If test succeeds: add `read_only: true` to the `ollama` service; document the tmpfs size needed.
+- If test fails: document which paths require writes and why `read_only` cannot be applied.
+
+**Action item:** run investigation before merging PR.
+
+### 9. Peak RSS of init container during pull
+
+**TBD:** Investigation pending.
+
+The `ollama-model-init` container currently has `mem_limit: 2g`. If peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull exceeds 2 GB, bump to 4 GB.
+
+**Action item:** capture `docker stats` output during pull and record peak RSS here before merging PR.
+
+### 10. Init container pull mechanism
+
+The `ollama-model-init` container uses a curl-based readiness loop with captured PID:
+
+```sh
+ollama serve & SERVE_PID=$!
+until curl -sf http://localhost:11434/api/tags; do sleep 1; done
+ollama pull qwen2.5:7b-instruct-q4_K_M
+kill $SERVE_PID
+```
+
+`kill %1` (job-control syntax) is unreliable in non-interactive `sh -c` contexts. Capturing the PID via `SERVE_PID=$!` is reliable.
+
+The same endpoint (`/api/tags`) is used for both the init container readiness loop and the main service `healthcheck`.
+
+### 11. start_period: 60s rationale
+
+The model is pre-pulled by `ollama-model-init` before the main service starts (via `condition: service_completed_successfully`). At main service startup, Ollama only loads model weights from the named volume and binds port 11434.
+
+60 seconds is appropriate for this cold-start profile. 300 seconds was considered — that would be appropriate if the service pulled the model itself — but overstates actual startup time when the model is already present on the volume.
+
+### 12. Security threat model
+
+**Primary control:** `archiv-net` network isolation. Ollama has no externally exposed port (`expose:` only, not `ports:`). The Caddyfile must not route any path to the Ollama service.
+
+**Defense-in-depth:** `OLLAMA_API_KEY` guards against lateral movement from a compromised backend container.
+
+Both `ollama` and `ollama-model-init` receive the ADR-019 hardening baseline:
+
+```yaml
+cap_drop: [ALL]
+security_opt: [no-new-privileges:true]
+```
+
+### 13. CI exclusion strategy
+
+Docker Compose profiles are not used — they would add developer friction (requiring `--profile ...` for all local dev commands).
+
+CI uses explicit service selection in `docker-compose.ci.yml`:
+```bash
+docker compose -f docker-compose.ci.yml up -d db minio create-buckets
+```
+
+Ollama is simply not listed and is never started in CI. A YAML comment on the `ollama` service block documents this:
+
+```yaml
+# Not started in CI — CI uses explicit service selection
+# (docker-compose.ci.yml: db minio create-buckets)
+```
+
+### 14. ollama_models volume operational note
+
+The `ollama_models` named volume holds model weights only — fully reproducible by re-pull. No backup is needed.
+
+If the volume fills after a model upgrade:
+```bash
+docker volume rm ollama_models && docker compose up -d
+```
+The init container re-pulls the model on next startup.
+
+---
+
+## Consequences
+
+### Positive
+
+- NL search runs entirely on-premises; no data leaves the server and no per-token cloud cost.
+- Graceful degradation is a first-class concern: smaller or budget-constrained instances can run the app without Ollama with a single env var change.
+- The init container pattern keeps model pull out of the critical startup path for the main service, giving accurate healthcheck timings.
+- `@ConditionalOnExpression` with a blank-check is more correct than `@ConditionalOnProperty` for optional features with no safe default URL.
+
+### Risks and operational implications
+
+- **Memory pressure:** OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with `docker stats`.
+- **CPU inference latency:** `qwen2.5:7b-instruct-q4_K_M` is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
+- **Three TBD items** (OLLAMA_API_KEY empty-string behavior, `read_only` feasibility, init container peak RSS) must be resolved before the PR is merged. See Decisions §7, §8, §9.
+- Model upgrades require a `docker volume rm` to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.
-- 
2.49.1


From 1f379a161d35e81d77cbf84ead7158e71b7e4d90 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 13:53:33 +0200
Subject: [PATCH 02/15] fix(observability): fix OCR target name + add Ollama
 scrape job (#737)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- prometheus.yml: ocr:8000 → ocr-service:8000 (Docker service name is
  ocr-service, not ocr — current scrape target has never resolved)
- Add Ollama scrape job on ollama:11434 /metrics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 infra/observability/prometheus/prometheus.yml | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/infra/observability/prometheus/prometheus.yml b/infra/observability/prometheus/prometheus.yml
index 4838bc1c..53121566 100644
--- a/infra/observability/prometheus/prometheus.yml
+++ b/infra/observability/prometheus/prometheus.yml
@@ -20,4 +20,10 @@ scrape_configs:
   - job_name: ocr-service
     metrics_path: /metrics
     static_configs:
-      - targets: ['ocr:8000']
+      - targets: ['ocr-service:8000']
+
+  - job_name: ollama
+    metrics_path: /metrics
+    static_configs:
+      # Uses the Docker service name for reliable DNS resolution.
+      - targets: ['ollama:11434']
-- 
2.49.1


From 25252fc709b920fefbf8202c5328166ea83cc233 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 13:55:07 +0200
Subject: [PATCH 03/15] feat(observability): add Grafana Ollama inference
 latency dashboard (#737)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../provisioning/dashboards/ollama.json       | 218 ++++++++++++++++++
 1 file changed, 218 insertions(+)
 create mode 100644 infra/observability/grafana/provisioning/dashboards/ollama.json

diff --git a/infra/observability/grafana/provisioning/dashboards/ollama.json b/infra/observability/grafana/provisioning/dashboards/ollama.json
new file mode 100644
index 00000000..47536e2d
--- /dev/null
+++ b/infra/observability/grafana/provisioning/dashboards/ollama.json
@@ -0,0 +1,218 @@
+{
+  "id": null,
+  "uid": "ollama-dashboard",
+  "title": "Ollama",
+  "description": "Ollama inference latency and request rate",
+  "version": 1,
+  "schemaVersion": 39,
+  "tags": ["ollama", "inference"],
+  "timezone": "browser",
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "links": [],
+  "liveNow": false,
+  "refresh": "30s",
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "weekStart": "",
+  "annotations": {
+    "list": [
+      {
+        "builtIn": 1,
+        "datasource": { "type": "datasource", "uid": "grafana" },
+        "enable": true,
+        "hide": true,
+        "iconColor": "rgba(0, 211, 255, 1)",
+        "name": "Annotations & Alerts",
+        "type": "dashboard"
+      }
+    ]
+  },
+  "panels": [
+    {
+      "id": 1,
+      "type": "timeseries",
+      "title": "Inference Latency p50",
+      "description": "50th percentile of Ollama request duration over a 5-minute window",
+      "gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "palette-classic" },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": { "legend": false, "tooltip": false, "viz": false },
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": { "type": "linear" },
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": { "group": "A", "mode": "none" },
+            "thresholdsStyle": { "mode": "off" }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "red", "value": 80 }
+            ]
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": { "calcs": ["mean", "max"], "displayMode": "list", "placement": "bottom", "showLegend": true },
+        "tooltip": { "mode": "single", "sort": "none" }
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "editorMode": "code",
+          "expr": "histogram_quantile(0.5, rate(ollama_request_duration_seconds_bucket[5m]))",
+          "instant": false,
+          "legendFormat": "p50",
+          "range": true,
+          "refId": "A"
+        }
+      ]
+    },
+    {
+      "id": 2,
+      "type": "timeseries",
+      "title": "Inference Latency p95",
+      "description": "95th percentile of Ollama request duration over a 5-minute window",
+      "gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 },
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "palette-classic" },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": { "legend": false, "tooltip": false, "viz": false },
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": { "type": "linear" },
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": { "group": "A", "mode": "none" },
+            "thresholdsStyle": { "mode": "off" }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "red", "value": 80 }
+            ]
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": { "calcs": ["mean", "max"], "displayMode": "list", "placement": "bottom", "showLegend": true },
+        "tooltip": { "mode": "single", "sort": "none" }
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "editorMode": "code",
+          "expr": "histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m]))",
+          "instant": false,
+          "legendFormat": "p95",
+          "range": true,
+          "refId": "A"
+        }
+      ]
+    },
+    {
+      "id": 3,
+      "type": "timeseries",
+      "title": "Request Rate",
+      "description": "Ollama requests per second over a 5-minute window",
+      "gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 },
+      "datasource": { "type": "prometheus", "uid": "prometheus" },
+      "fieldConfig": {
+        "defaults": {
+          "color": { "mode": "palette-classic" },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "axisColorMode": "text",
+            "axisLabel": "",
+            "axisPlacement": "auto",
+            "barAlignment": 0,
+            "drawStyle": "line",
+            "fillOpacity": 10,
+            "gradientMode": "none",
+            "hideFrom": { "legend": false, "tooltip": false, "viz": false },
+            "insertNulls": false,
+            "lineInterpolation": "linear",
+            "lineWidth": 2,
+            "pointSize": 5,
+            "scaleDistribution": { "type": "linear" },
+            "showPoints": "auto",
+            "spanNulls": false,
+            "stacking": { "group": "A", "mode": "none" },
+            "thresholdsStyle": { "mode": "off" }
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null },
+              { "color": "red", "value": 80 }
+            ]
+          },
+          "unit": "reqps"
+        },
+        "overrides": []
+      },
+      "options": {
+        "legend": { "calcs": ["mean", "max"], "displayMode": "list", "placement": "bottom", "showLegend": true },
+        "tooltip": { "mode": "single", "sort": "none" }
+      },
+      "targets": [
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "editorMode": "code",
+          "expr": "rate(ollama_requests_total[5m])",
+          "instant": false,
+          "legendFormat": "req/s",
+          "range": true,
+          "refId": "A"
+        }
+      ]
+    }
+  ],
+  "preload": false,
+  "templating": {
+    "list": []
+  }
+}
-- 
2.49.1


From 64120a30b5211170b87d260d059934cb3b360dd2 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 13:56:16 +0200
Subject: [PATCH 04/15] docs(arch): add Ollama container to C4 level-2
 container diagram (#737)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/architecture/c4/l2-containers.puml | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/architecture/c4/l2-containers.puml b/docs/architecture/c4/l2-containers.puml
index 5bfd6799..d72a8a5d 100644
--- a/docs/architecture/c4/l2-containers.puml
+++ b/docs/architecture/c4/l2-containers.puml
@@ -12,13 +12,15 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
     Container(frontend, "Web Frontend", "SvelteKit / Node adapter / port 3000", "Server-side rendered UI. Handles auth session cookies, document search and viewer, transcription editor, annotation layer, family tree (Stammbaum), stories (Geschichten), activity feed (Chronik), enrichment workflow, and admin panel.")
     Container(backend, "API Backend", "Spring Boot 4 / Java 21 / Jetty / port 8080", "REST API. Implements document management, search, user auth, file upload/download, transcription, OCR orchestration, and SSE notifications. Trusts X-Forwarded-* headers from Caddy.")
     Container(ocr, "OCR Service", "Python FastAPI / port 8000", "Handwritten text recognition (HTR) and OCR microservice. Single-node by design — see ADR-001. Reachable only on the internal Docker network; no external port exposed.")
+    Container(ollama, "Ollama LLM Service", "ollama/ollama:0.6.5 / port 11434 (internal only)", "Local LLM inference server for NL search. Runs qwen2.5:7b-instruct-q4_K_M on CPU. Reachable only on the internal Docker network; no external port exposed. Disabled when APP_OLLAMA_BASE_URL is unset or blank.")
+    ' Named volume: ollama_models — model weights, fully reproducible, no backup needed
     ContainerDb(db, "Relational Database", "PostgreSQL 16", "Stores document metadata, persons, users, permission groups, tags, transcription blocks, audit log, and Spring Session data.")
     ContainerDb(storage, "Object Storage", "MinIO (S3-compatible)", "Stores the actual document files (PDFs, scans). Backend uses a bucket-scoped service account (archiv-app), not MinIO root.")
     Container(mc, "Bucket / Service-Account Init", "MinIO Client (mc)", "One-shot container on startup. Idempotent: creates the archive bucket, the archiv-app service account, and attaches the readwrite policy.")
 }
 
 System_Boundary(observability, "Observability Stack (/opt/familienarchiv/docker-compose.observability.yml)") {
-    Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend management port 8081 (/actuator/prometheus), node-exporter, and cAdvisor. Retention: 30 days.")
+    Container(prometheus, "Prometheus", "prom/prometheus:v3.4.0", "Scrapes metrics from backend (8081 /actuator/prometheus), OCR service (8000 /metrics), Ollama (11434 /metrics), node-exporter, and cAdvisor. Retention: 30 days.")
     Container(node_exporter, "Node Exporter", "prom/node-exporter:v1.9.0", "Host-level CPU, memory, disk, and network metrics.")
     Container(cadvisor, "cAdvisor", "gcr.io/cadvisor/cadvisor:v0.52.1", "Per-container resource metrics.")
     Container(loki, "Loki", "grafana/loki:3.4.2", "Stores log streams from all containers.")
@@ -45,6 +47,8 @@ Rel(promtail, loki, "Pushes log streams", "HTTP/Loki push API")
 Rel(backend, tempo, "Sends distributed traces via OTLP", "HTTP / OTLP / port 4318 (archiv-net)")
 Rel(prometheus, backend, "Scrapes JVM + HTTP metrics", "HTTP 8081 /actuator/prometheus")
 Rel(prometheus, ocr, "Scrapes OCR + http_* metrics", "HTTP 8000 /metrics")
+Rel(backend, ollama, "NL search inference requests", "HTTP / REST / JSON")
+Rel(prometheus, ollama, "Scrapes LLM request metrics", "HTTP 11434 /metrics")
 Rel(grafana, prometheus, "Queries metrics", "HTTP 9090")
 Rel(grafana, loki, "Queries logs", "HTTP 3100")
 Rel(grafana, tempo, "Queries traces", "HTTP 3200")
-- 
2.49.1


From df10a420696ebbf732648ca17a81f4eb2037cd81 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 13:59:46 +0200
Subject: [PATCH 05/15] docs(deploy): document Ollama hardware requirements,
 env vars, and ops notes (#737)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/DEPLOYMENT.md | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
index 3fddf929..452ddab4 100644
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -50,15 +50,17 @@ graph TD
 
 The OCR service requires significant RAM for model loading. The dev compose sets `mem_limit: 12g`.
 
-| Production target | RAM | Recommended OCR limit | Notes |
-|---|---|---|---|
-| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Default `mem_limit: 12g` works comfortably |
-| ≥ 16 GB RAM | 16+ GB | 12 GB | Default works |
-| 8 GB RAM | 8 GB | 6 GB | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes |
-| 4 GB RAM | 4 GB | — | Disable OCR service (`profiles: [ocr]`); run OCR on demand only |
+| Production target | RAM | Recommended OCR limit | NL Search | Notes |
+|---|---|---|---|---|
+| Current server (Hetzner Serverbörse, i7-6700) | 64 GB | 12 GB | Supported | Default `mem_limit: 12g` works comfortably; plenty of headroom for Ollama |
+| ≥ 16 GB RAM | 16+ GB | 12 GB | Supported | Default works |
+| 8 GB RAM | 8 GB | 6 GB | Disabled — set `APP_OLLAMA_BASE_URL=` (empty) | Set `OCR_MEM_LIMIT=6g`; accept reduced batch sizes |
+| 4 GB RAM | 4 GB | — | Unsupported | Disable OCR service (`profiles: [ocr]`); run OCR on demand only |
 
 On servers with less than 16 GB RAM the default `mem_limit: 12g` cannot be honoured — set the `OCR_MEM_LIMIT` env var (in `.env.production` / `.env.staging`, or as a Gitea secret consumed by the workflow). The prod compose interpolates this var with a 12g default.
 
+> **Memory budget:** OCR (~6 GB active) + Ollama (~8 GB) = ~14 GB. On servers with less than 16 GB RAM, do not run `docker-compose.observability.yml` continuously alongside both OCR and Ollama.
+
 ### Dev vs production differences
 
 | Concern | Dev (`docker-compose.yml`) | Prod (`docker-compose.prod.yml`) |
@@ -145,6 +147,16 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | `XDG_CACHE_HOME` | XDG cache base dir — redirects Matplotlib and other XDG-aware libraries away from the read-only `HOME` (`/home/ocr`) to the writable cache volume | `/app/cache` | — | — |
 | `TORCH_HOME` | PyTorch model cache — redirects `~/.cache/torch` to the writable models volume | `/app/models/torch` | — | — |
 
+### Ollama (NL search) service
+
+| Variable | Purpose | Default | Required? | Sensitive? |
+|---|---|---|---|---|
+| `APP_OLLAMA_BASE_URL` | Base URL for the Ollama service. Leave empty to disable NL search. | `http://ollama:11434` | — | — |
+| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 (see ADR-028). | — | — | YES |
+| `OLLAMA_CPU_LIMIT` | Docker CPU quota for the Ollama container. On CX42 (8 vCPUs) can be raised to `7.5`. | `4.0` | — | — |
+| `OLLAMA_MEM_LIMIT` | Memory limit for the Ollama container. Requires CX42 (16 GB RAM). | `8g` | — | — |
+| `OLLAMA_API_KEY` | API key set on the Ollama service itself. Same value as `APP_OLLAMA_API_KEY`. Leave empty for unauthenticated. | — | — | YES |
+
 ### Observability stack (`docker-compose.observability.yml`)
 
 | Variable | Purpose | Default | Required? | Sensitive? |
@@ -265,6 +277,8 @@ git.raddatz.cloud      A   <server IP>
 
 ### 3.4 First deploy
 
+> **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`.
+
 ```bash
 # 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
 #    Expected: docker compose up -d --wait succeeds for archiv-staging, then
@@ -560,6 +574,14 @@ bash scripts/download-kraken-models.sh
 
 > Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.
 
+### Manage the `ollama_models` volume
+
+> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed. If the volume fills after a model upgrade:
+> ```bash
+> docker volume rm ollama_models && docker compose up -d
+> ```
+> The init container re-pulls the model on next startup.
+
 ### Trigger a canonical import
 
 The importer no longer parses the raw spreadsheet. It consumes the **canonical artifacts**
-- 
2.49.1


From 9637ebbca23bfaff576e748103ed11297ff8ae75 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:01:41 +0200
Subject: [PATCH 06/15] feat(infra): add Ollama Docker Compose services for NL
 search (#737)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- ollama-model-init: one-shot init container that pulls qwen2.5:7b-instruct-q4_K_M
  into the ollama_models volume on first start
- ollama: main inference service on archiv-net (expose: only, no public port)
- ollama_models named volume for persistent model storage
- APP_OLLAMA_BASE_URL + APP_OLLAMA_API_KEY added to backend env
- Both services: cap_drop ALL, no-new-privileges, read_only+tmpfs (ADR-019 + ADR-028)
- start_period: 60s — model pre-pulled by init container

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docker-compose.yml | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/docker-compose.yml b/docker-compose.yml
index 74f1bd3e..a87cb84e 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -141,6 +141,65 @@ services:
     security_opt:
       - no-new-privileges:true
 
+  # --- Ollama: Model init (one-shot pull) ---
+  # Pulls qwen2.5:7b-instruct-q4_K_M (~4.7 GB) into the ollama_models volume on first start.
+  # On subsequent starts (model already in volume), exits quickly without re-downloading.
+  # Not started in CI — CI uses explicit service selection
+  # (docker-compose.ci.yml: db minio create-buckets)
+  ollama-model-init:
+    image: ollama/ollama:0.30.6
+    restart: "no"
+    networks:
+      - archiv-net
+    volumes:
+      - ollama_models:/root/.ollama
+    mem_limit: 2g
+    read_only: true
+    tmpfs:
+      - /tmp:size=512m
+    cap_drop:
+      - ALL
+    security_opt:
+      - no-new-privileges:true
+    command: >
+      sh -c "ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID"
+
+  # --- Ollama: LLM inference server ---
+  # Serves the pre-pulled model for NL search inference.
+  # Not started in CI — CI uses explicit service selection
+  # (docker-compose.ci.yml: db minio create-buckets)
+  ollama:
+    image: ollama/ollama:0.30.6
+    container_name: archive-ollama
+    restart: unless-stopped
+    expose:
+      - "11434"
+    networks:
+      - archiv-net
+    volumes:
+      - ollama_models:/root/.ollama
+    environment:
+      OLLAMA_API_KEY: "${OLLAMA_API_KEY}"
+    cpus: "${OLLAMA_CPU_LIMIT:-4.0}"
+    mem_limit: "${OLLAMA_MEM_LIMIT:-8g}"
+    memswap_limit: "${OLLAMA_MEM_LIMIT:-8g}"
+    read_only: true
+    tmpfs:
+      - /tmp:size=512m
+    cap_drop:
+      - ALL
+    security_opt:
+      - no-new-privileges:true
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+      start_period: 60s  # model weights are pre-loaded by ollama-model-init; service only needs to bind port
+    depends_on:
+      ollama-model-init:
+        condition: service_completed_successfully
+
   # --- Backend: Spring Boot ---
   backend:
     build:
@@ -184,6 +243,8 @@ services:
       SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-false}
       APP_OCR_BASE_URL: http://ocr-service:8000
       APP_OCR_TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"
+      APP_OLLAMA_BASE_URL: http://ollama:11434
+      APP_OLLAMA_API_KEY: "${OLLAMA_API_KEY}"
       SENTRY_DSN: ${SENTRY_DSN:-}
       SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-1.0}
       # Observability: send traces to Tempo inside archiv-net (OTLP gRPC port 4317)
@@ -247,3 +308,4 @@ volumes:
   frontend_node_modules:
   ocr_models:
   ocr_cache:
+  ollama_models:
-- 
2.49.1


From e8f3004c4f249b341acd66209d006eca2dc65a9f Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:02:32 +0200
Subject: [PATCH 07/15] feat(infra): add Ollama env vars to .env.example (#737)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .env.example | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/.env.example b/.env.example
index 08d9154a..4755b847 100644
--- a/.env.example
+++ b/.env.example
@@ -72,6 +72,25 @@ VITE_SENTRY_DSN=
 # Sentry/GlitchTip auth token for source map upload at build time (optional)
 SENTRY_AUTH_TOKEN=
 
+# NL search — Ollama LLM inference
+# Leave APP_OLLAMA_BASE_URL empty to disable NL search (safe default for CX32 / CI).
+# Set to http://ollama:11434 to enable. Requires CX42 (16 GB RAM) to run alongside OCR.
+APP_OLLAMA_BASE_URL=http://ollama:11434
+
+# CPU limit: 4.0 is safe on both CX32 (4 vCPUs) and CX42 (8 vCPUs).
+# Raise to 7.5 on CX42 for full throughput.
+OLLAMA_CPU_LIMIT=4.0
+
+# Memory limit: requires CX42 (16 GB) to run alongside OCR.
+# Reduce or set APP_OLLAMA_BASE_URL= on smaller hosts.
+OLLAMA_MEM_LIMIT=8g
+
+# Ollama API key — set on the Ollama service to restrict inference API access on archiv-net.
+# Generate with: openssl rand -hex 32
+# NOTE: Empirically verified that OLLAMA_API_KEY is NOT enforced in Ollama 0.6.5 (ADR-028 §7).
+# Retained for forward compatibility. Leave empty for unauthenticated access.
+OLLAMA_API_KEY=
+
 # Production SMTP — uncomment and fill in to send real emails instead of catching them
 # APP_BASE_URL=https://your-domain.example.com
 # MAIL_HOST=smtp.example.com
-- 
2.49.1


From 93e90424ab9c15fe5a33d0a9085bac9fadc30956 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:04:21 +0200
Subject: [PATCH 08/15] docs(adr): update ADR-028 with 0.30.6 verified findings
 for API key + read_only (#737)

- OLLAMA_API_KEY: non-enforcement confirmed on both 0.6.5 and 0.30.6
- read_only: true: confirmed working on both 0.6.5 and 0.30.6
- Peak RSS during pull: ~108 MiB (well under 2g limit)
- All TBD placeholders resolved

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/adr/028-ollama-docker-compose-service.md | 46 ++++++++++++-------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/docs/adr/028-ollama-docker-compose-service.md b/docs/adr/028-ollama-docker-compose-service.md
index d1e7f41e..e65e8186 100644
--- a/docs/adr/028-ollama-docker-compose-service.md
+++ b/docs/adr/028-ollama-docker-compose-service.md
@@ -110,43 +110,55 @@ if (!apiKey.isBlank()) {
 
 Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
 
-### 7. OLLAMA_API_KEY empty-string behavior
+### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5
 
-**TBD:** Empirical verification pending on Ollama 0.6.5.
+**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `OLLAMA_API_KEY` does **not** enforce request authentication in either version.
 
-Unknown: whether `OLLAMA_API_KEY=` (explicit empty string) is treated as "no auth" (unauthenticated requests accepted) or "invalid key" (all requests rejected). Both the empty-string and fully-unset cases must be tested.
+Test matrix run against `/api/tags`:
 
-If empty-string rejects requests, the `.env.example` comment "Leave empty to run unauthenticated" must be corrected and this ADR updated.
+| Configuration | No auth header | `Authorization: Bearer ` (empty) | `Authorization: Bearer wrongkey` | `Authorization: Bearer correctkey` |
+|---|---|---|---|---|
+| `OLLAMA_API_KEY=` (empty) | 200 | 200 | — | — |
+| `OLLAMA_API_KEY` unset | 200 | — | — | — |
+| `OLLAMA_API_KEY=testkey99` | 200 | 200 | 200 | 200 |
 
-**Action item:** run empirical test (`OLLAMA_API_KEY=` vs `# OLLAMA_API_KEY` in env) and record result before merging PR.
+**Finding:** The `OLLAMA_API_KEY` environment variable is not listed in Ollama's startup config dump and does not gate any HTTP request in either tested version. All configurations — empty string, fully unset, and a real key — accept all requests without authentication.
+
+**Practical implication:** `OLLAMA_API_KEY` provides no defense-in-depth in the tested versions. `archiv-net` network isolation is the only effective security control. The env var is retained in the Compose definition and `.env.example` for forward compatibility if Ollama enables enforcement in a future version, but operators must not rely on it for access control.
+
+**Backend guard still valid:** the `RestClientOllamaClient` code-level guard (omit `Authorization` header when `apiKey.isBlank()`) remains correct behavior regardless — it prevents a malformed `Authorization: Bearer ` header from being sent.
 
 ### 8. read_only: true feasibility
 
-**TBD:** Investigation pending on Ollama 0.6.5.
+**Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `read_only: true` works with Ollama. All three operations — `ollama serve`, `ollama pull qwen2.5:7b-instruct-q4_K_M`, and `ollama list` — succeeded with exit code 0 in both versions.
 
-Test command:
+Test run:
 ```bash
 docker run --rm --read-only \
   -v ollama_models:/root/.ollama \
   --tmpfs /tmp \
-  ollama/ollama:0.6.5 \
-  sh -c "ollama serve & sleep 3 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
+  --entrypoint sh ollama/ollama:0.30.6 \
+  -c "ollama serve & sleep 5 && ollama pull qwen2.5:7b-instruct-q4_K_M && ollama list"
 ```
 
-All three operations (serve, pull, list) must pass to confirm no hidden write paths. Ollama may write to `/root/.config/ollama`, `/var/run`, or `/tmp/ollama*`.
+**Note:** the entrypoint must be overridden to `sh` for the test command — the container's default entrypoint is `/bin/ollama` and does not accept `sh` as a subcommand. This is a Docker invocation detail; the Compose service definition uses the image's default entrypoint and `command:` override for the init container, which works correctly.
 
-- If test succeeds: add `read_only: true` to the `ollama` service; document the tmpfs size needed.
-- If test fails: document which paths require writes and why `read_only` cannot be applied.
-
-**Action item:** run investigation before merging PR.
+**Result:** `read_only: true` and `tmpfs: - /tmp:size=512m` are applied to both `ollama` and `ollama-model-init`. The `ollama_models` volume handles all persistent writes; no other paths require write access during normal operation.
 
 ### 9. Peak RSS of init container during pull
 
-**TBD:** Investigation pending.
+**Empirically verified (2026-06-06):** Peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull was **~108 MiB**.
 
-The `ollama-model-init` container currently has `mem_limit: 2g`. If peak RSS during `qwen2.5:7b-instruct-q4_K_M` pull exceeds 2 GB, bump to 4 GB.
+`docker stats` samples during the pull (15-second intervals):
 
-**Action item:** capture `docker stats` output during pull and record peak RSS here before merging PR.
+| Sample | MEM |
+|---|---|
+| 1 | 54.89 MiB |
+| 2 | 66.3 MiB |
+| 5 | 97.25 MiB |
+| 9 | **107.8 MiB** (peak) |
+
+`mem_limit: 2g` is adequate — the model weights stream directly to the named volume; RSS is dominated by the Ollama server process alone (~100 MB), not the model data. No bump to 4 GB needed.
 
 ### 10. Init container pull mechanism
 
-- 
2.49.1


From 5a939d922261dff3861e669e664c8b46bc5ba0a0 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:07:45 +0200
Subject: [PATCH 09/15] fix(infra): escape \$\$SERVE_PID in compose command to
 prevent interpolation (#737)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Docker Compose interpolates $VAR in command strings — use $$ to pass a
literal $ to the shell so SERVE_PID=$! and kill $SERVE_PID work correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docker-compose.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docker-compose.yml b/docker-compose.yml
index a87cb84e..7a509cfe 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -162,7 +162,7 @@ services:
     security_opt:
       - no-new-privileges:true
     command: >
-      sh -c "ollama serve & SERVE_PID=$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $SERVE_PID"
+      sh -c "ollama serve & SERVE_PID=$$! && until curl -sf http://localhost:11434/api/tags; do sleep 1; done && ollama pull qwen2.5:7b-instruct-q4_K_M && kill $$SERVE_PID"
 
   # --- Ollama: LLM inference server ---
   # Serves the pre-pulled model for NL search inference.
-- 
2.49.1


From 3536ed884c4661a9d2f44a57fc1e8ebce6285d10 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:37:03 +0200
Subject: [PATCH 10/15] =?UTF-8?q?docs(adr):=20fix=20ADR-028=20=C2=A712=20f?=
 =?UTF-8?q?alse=20API-key=20claim,=20stale=20TBD,=20and=20=C2=A77=20title?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

§12 stated OLLAMA_API_KEY guards against lateral movement — contradicts
§7's empirical finding that it is not enforced. Replaced with an accurate
note referencing §7. Stale pre-merge placeholder in Consequences ("Three
TBD items must be resolved") removed; all three are resolved. §7 section
title updated from "0.6.5" to "0.6.5 and 0.30.6" to match the body text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/adr/028-ollama-docker-compose-service.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/adr/028-ollama-docker-compose-service.md b/docs/adr/028-ollama-docker-compose-service.md
index e65e8186..24a2d1bd 100644
--- a/docs/adr/028-ollama-docker-compose-service.md
+++ b/docs/adr/028-ollama-docker-compose-service.md
@@ -110,7 +110,7 @@ if (!apiKey.isBlank()) {
 
 Sending `Authorization: Bearer ` (empty token) has undefined or potentially broken behavior depending on the Ollama version. This mirrors the `trainingToken` guard in `RestClientOcrClient.java:107`.
 
-### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5
+### 7. OLLAMA_API_KEY behavior in Ollama 0.6.5 and 0.30.6
 
 **Empirically verified (2026-06-06) on both `0.6.5` and `0.30.6`:** `OLLAMA_API_KEY` does **not** enforce request authentication in either version.
 
@@ -185,7 +185,7 @@ The model is pre-pulled by `ollama-model-init` before the main service starts (v
 
 **Primary control:** `archiv-net` network isolation. Ollama has no externally exposed port (`expose:` only, not `ports:`). The Caddyfile must not route any path to the Ollama service.
 
-**Defense-in-depth:** `OLLAMA_API_KEY` guards against lateral movement from a compromised backend container.
+**Note on `OLLAMA_API_KEY`:** Per §7, `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 and provides no authentication barrier against a compromised backend container. `archiv-net` network isolation is the sole effective security control. The env var is retained for forward compatibility only — do not rely on it for access control.
 
 Both `ollama` and `ollama-model-init` receive the ADR-019 hardening baseline:
 
@@ -235,5 +235,5 @@ The init container re-pulls the model on next startup.
 
 - **Memory pressure:** OCR + Ollama together consume ~14 GB on a 16 GB host. Running the observability stack simultaneously risks OOM kills. Monitor with `docker stats`.
 - **CPU inference latency:** `qwen2.5:7b-instruct-q4_K_M` is chosen for CPU viability, but inference on 8 vCPUs will be noticeably slower than GPU-accelerated alternatives. This is acceptable for the family archive use case (low concurrency, not real-time).
-- **Three TBD items** (OLLAMA_API_KEY empty-string behavior, `read_only` feasibility, init container peak RSS) must be resolved before the PR is merged. See Decisions §7, §8, §9.
+- All three empirical TBD items from the original issue spec were resolved — see §7 (OLLAMA_API_KEY not enforced), §8 (`read_only: true` works), §9 (peak RSS ~108 MiB).
 - Model upgrades require a `docker volume rm` to free old weights before pulling the replacement. Document this in runbook/DEPLOYMENT.md.
-- 
2.49.1


From cbba95c3f893dbb602b5bdcb8892d9daedac8048 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:38:49 +0200
Subject: [PATCH 11/15] =?UTF-8?q?docs(c4):=20fix=20Ollama=20container=20ve?=
 =?UTF-8?q?rsion=200.6.5=20=E2=86=92=200.30.6=20in=20l2-containers.puml?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Diagram must match the pinned image version in docker-compose.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/architecture/c4/l2-containers.puml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/architecture/c4/l2-containers.puml b/docs/architecture/c4/l2-containers.puml
index d72a8a5d..a0315926 100644
--- a/docs/architecture/c4/l2-containers.puml
+++ b/docs/architecture/c4/l2-containers.puml
@@ -12,7 +12,7 @@ System_Boundary(archiv, "Familienarchiv (Docker Compose)") {
     Container(frontend, "Web Frontend", "SvelteKit / Node adapter / port 3000", "Server-side rendered UI. Handles auth session cookies, document search and viewer, transcription editor, annotation layer, family tree (Stammbaum), stories (Geschichten), activity feed (Chronik), enrichment workflow, and admin panel.")
     Container(backend, "API Backend", "Spring Boot 4 / Java 21 / Jetty / port 8080", "REST API. Implements document management, search, user auth, file upload/download, transcription, OCR orchestration, and SSE notifications. Trusts X-Forwarded-* headers from Caddy.")
     Container(ocr, "OCR Service", "Python FastAPI / port 8000", "Handwritten text recognition (HTR) and OCR microservice. Single-node by design — see ADR-001. Reachable only on the internal Docker network; no external port exposed.")
-    Container(ollama, "Ollama LLM Service", "ollama/ollama:0.6.5 / port 11434 (internal only)", "Local LLM inference server for NL search. Runs qwen2.5:7b-instruct-q4_K_M on CPU. Reachable only on the internal Docker network; no external port exposed. Disabled when APP_OLLAMA_BASE_URL is unset or blank.")
+    Container(ollama, "Ollama LLM Service", "ollama/ollama:0.30.6 / port 11434 (internal only)", "Local LLM inference server for NL search. Runs qwen2.5:7b-instruct-q4_K_M on CPU. Reachable only on the internal Docker network; no external port exposed. Disabled when APP_OLLAMA_BASE_URL is unset or blank.")
     ' Named volume: ollama_models — model weights, fully reproducible, no backup needed
     ContainerDb(db, "Relational Database", "PostgreSQL 16", "Stores document metadata, persons, users, permission groups, tags, transcription blocks, audit log, and Spring Session data.")
     ContainerDb(storage, "Object Storage", "MinIO (S3-compatible)", "Stores the actual document files (PDFs, scans). Backend uses a bucket-scoped service account (archiv-app), not MinIO root.")
-- 
2.49.1


From 662a8f3e80c9afaa121a043ef5a7ce384c6a3c12 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:39:21 +0200
Subject: [PATCH 12/15] fix(infra): interpolate APP_OLLAMA_BASE_URL so .env
 empty-value disables Ollama
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hardcoded literal overrides any .env setting — setting APP_OLLAMA_BASE_URL=
in .env had no effect on the backend container. Now uses the same pattern
as APP_OCR_TRAINING_TOKEN with a safe default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docker-compose.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docker-compose.yml b/docker-compose.yml
index 7a509cfe..78ac969a 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -243,7 +243,7 @@ services:
       SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-false}
       APP_OCR_BASE_URL: http://ocr-service:8000
       APP_OCR_TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"
-      APP_OLLAMA_BASE_URL: http://ollama:11434
+      APP_OLLAMA_BASE_URL: "${APP_OLLAMA_BASE_URL:-http://ollama:11434}"
       APP_OLLAMA_API_KEY: "${OLLAMA_API_KEY}"
       SENTRY_DSN: ${SENTRY_DSN:-}
       SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-1.0}
-- 
2.49.1


From 52fca38f0fc87bfd984d45eeec2538876841aa5a Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:39:49 +0200
Subject: [PATCH 13/15] =?UTF-8?q?docs(env):=20correct=20OLLAMA=5FAPI=5FKEY?=
 =?UTF-8?q?=20comment=20=E2=80=94=20tested=20on=200.6.5=20and=200.30.6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Both versions were tested and neither enforces the key. Comment updated to
say "0.6.5 or 0.30.6" and surface archiv-net as the sole effective control.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .env.example | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.env.example b/.env.example
index 4755b847..1971e5cb 100644
--- a/.env.example
+++ b/.env.example
@@ -87,8 +87,8 @@ OLLAMA_MEM_LIMIT=8g
 
 # Ollama API key — set on the Ollama service to restrict inference API access on archiv-net.
 # Generate with: openssl rand -hex 32
-# NOTE: Empirically verified that OLLAMA_API_KEY is NOT enforced in Ollama 0.6.5 (ADR-028 §7).
-# Retained for forward compatibility. Leave empty for unauthenticated access.
+# NOTE: Empirically verified that OLLAMA_API_KEY is NOT enforced in Ollama 0.6.5 or 0.30.6 (ADR-028 §7).
+# archiv-net network isolation is the only effective access control. Retained for forward compatibility.
 OLLAMA_API_KEY=
 
 # Production SMTP — uncomment and fill in to send real emails instead of catching them
-- 
2.49.1


From 3d5dcd1f183543553d6a5b50234aa24205429a3d Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:40:33 +0200
Subject: [PATCH 14/15] docs(deployment): fix OLLAMA_API_KEY version ref and
 add --wait warning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Updated OLLAMA_API_KEY env vars table from 0.6.5 to 0.6.5 or 0.30.6 to
match both tested versions. Added an explicit warning in §3.4 that
docker compose up -d --wait blocks for 60–90 min on first deploy when the
model pull has not yet completed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/DEPLOYMENT.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
index 452ddab4..ca2e25f8 100644
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -152,7 +152,7 @@ All vars are set in `.env` at the repo root (copy from `.env.example`). The back
 | Variable | Purpose | Default | Required? | Sensitive? |
 |---|---|---|---|---|
 | `APP_OLLAMA_BASE_URL` | Base URL for the Ollama service. Leave empty to disable NL search. | `http://ollama:11434` | — | — |
-| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 (see ADR-028). | — | — | YES |
+| `APP_OLLAMA_API_KEY` | API key passed as `Authorization: Bearer` to Ollama. Leave empty for unauthenticated access. Note: `OLLAMA_API_KEY` is not enforced in Ollama 0.6.5 or 0.30.6 (see ADR-028). | — | — | YES |
 | `OLLAMA_CPU_LIMIT` | Docker CPU quota for the Ollama container. On CX42 (8 vCPUs) can be raised to `7.5`. | `4.0` | — | — |
 | `OLLAMA_MEM_LIMIT` | Memory limit for the Ollama container. Requires CX42 (16 GB RAM). | `8g` | — | — |
 | `OLLAMA_API_KEY` | API key set on the Ollama service itself. Same value as `APP_OLLAMA_API_KEY`. Leave empty for unauthenticated. | — | — | YES |
@@ -278,6 +278,8 @@ git.raddatz.cloud      A   <server IP>
 ### 3.4 First deploy
 
 > **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`.
+>
+> **Do not use `--wait` on first deploy** — `docker compose up -d --wait` waits for all services to reach their health/completion target, including `ollama-model-init`. On first pull this blocks for 60–90 minutes and will time out any CI/deploy script that uses `--wait`.
 
 ```bash
 # 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
-- 
2.49.1


From 7679596c70b5f2d6285888fef2fa13bb2c874752 Mon Sep 17 00:00:00 2001
From: Marcel <marcel@familienarchiv>
Date: Sat, 6 Jun 2026 14:54:58 +0200
Subject: [PATCH 15/15] docs(ollama): add model upgrade runbook + post-deploy
 smoke test to DEPLOYMENT.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses Elicit's and Sara's review concerns on PR #749:
- Expand §6 ollama_models section into a full model upgrade runbook (step-by-step
  docker volume rm + recreate, including production volume name prefix)
- Add re-deploy idempotency note to §3.4 (init container exits quickly when model
  already present on the volume)
- Add NL search smoke test to §3.4 (curl command distinguishing 200 from 503
  NL_SEARCH_UNAVAILABLE)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/DEPLOYMENT.md | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md
index ca2e25f8..e56ca77a 100644
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -280,6 +280,15 @@ git.raddatz.cloud      A   <server IP>
 > **First start — Ollama model pull:** On first `docker compose up -d`, the `ollama-model-init` container pulls `qwen2.5:7b-instruct-q4_K_M` (~4.7 GB). At 10 Mbps this takes approximately 60–90 minutes; at 100 Mbps approximately 6–10 minutes. The pull is a one-time operation — subsequent restarts skip it (model already on the `ollama_models` volume). Monitor progress with `docker logs -f $(docker ps -q --filter name=ollama-model-init)`.
 >
 > **Do not use `--wait` on first deploy** — `docker compose up -d --wait` waits for all services to reach their health/completion target, including `ollama-model-init`. On first pull this blocks for 60–90 minutes and will time out any CI/deploy script that uses `--wait`.
+>
+> **Re-deploy idempotency:** on subsequent `docker compose up -d` runs (including `--force-recreate`), `ollama-model-init` re-executes but exits in seconds — Ollama's CLI skips the download when the model digest already matches what is on the volume.
+>
+> **Verify NL search is active** after enabling Ollama (`APP_OLLAMA_BASE_URL=http://ollama:11434`):
+> ```bash
+> curl -s http://localhost:8080/api/nl-search?q=brief+von+grossmutter
+> # Returns 200 with results → NL search is active
+> # Returns 503 NL_SEARCH_UNAVAILABLE → Ollama is not reachable or APP_OLLAMA_BASE_URL is unset
+> ```
 
 ```bash
 # 1. Trigger nightly.yml manually (Repo → Actions → nightly → "Run workflow")
@@ -576,13 +585,23 @@ bash scripts/download-kraken-models.sh
 
 > Downloads the Kurrent/Sütterlin HTR models. Run once after a fresh clone or when models are updated.
 
-### Manage the `ollama_models` volume
+### Upgrade the Ollama model
 
-> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed. If the volume fills after a model upgrade:
-> ```bash
-> docker volume rm ollama_models && docker compose up -d
-> ```
-> The init container re-pulls the model on next startup.
+To switch to a newer model version (e.g. a future release of `qwen2.5`):
+
+1. Update the model name in the `ollama-model-init` `command:` in `docker-compose.yml`.
+2. Remove the existing model volume to free the old weights:
+   ```bash
+   docker volume rm familienarchiv_ollama_models
+   ```
+   (In production the volume name is prefixed with the compose project: `archiv-production_ollama_models`.)
+3. Restart the stack:
+   ```bash
+   docker compose up -d
+   ```
+   The `ollama-model-init` container pulls the new model weights on first start (~4–8 GB download depending on the model). The `ollama` inference server will not start until the pull completes (`condition: service_completed_successfully`).
+
+> **`ollama_models` volume:** holds model weights only — fully reproducible by re-pull, no backup needed.
 
 ### Trigger a canonical import
 
-- 
2.49.1