As a user I want to generate a summary and suggest tags from the transcription so I don't have to do both by hand #310

New Issue

marcel · 2026-04-23T11:11:05+02:00

marcel commented

2026-04-23 11:11:05 +02:00

Goal

Give the user two buttons on a document that already has a transcription: "Zusammenfassung generieren" and "Tags vorschlagen". Both are user-triggered — nothing runs automatically, so there's no background job load and no surprise costs on the NAS. Everything self-hosted — no external LLM or embedding APIs, for privacy.

Architecture

Two new containers in docker-compose.yml:

ollama — stock Ollama image (ollama/ollama). Pulls gemma3:4b (quantized, ~3 GB) on first start. Exposes 11434 only on the internal Docker network. Model selection via env var so we can swap later (qwen2.5:7b, llama3.1:8b, etc.). Volume-mount ollama_models so the model survives container recreates.
lang-service — new FastAPI app under lang-service/ (mirrors the layout of ocr-service/). Owns:
- multilingual-e5-small sentence-embedding model (~470 MB, CPU, loaded at startup).
- HTTP client to Ollama.
- Tag-name embedding cache (in-memory, rebuilt on startup and when the backend signals a change — see caching section).

Keeping lang-service separate from ocr-service because (a) the OCR container is already close to its 12 GB memory limit; (b) the two concerns have independent lifecycles (model upgrades, scaling, restarts).

Frontend (Svelte)
   ↓
Backend (Spring Boot)
   ↓
lang-service (FastAPI) ───────► ollama (gemma3:4b)   [summary only]

Endpoints — `lang-service`

POST /summarize
- Body: { "transcription": "…", "scriptType": "TYPEWRITER" | "HANDWRITING_LATIN" | "HANDWRITING_KURRENT" }
- Response: { "summary": "…" } — single blocking call (timeout ~120 s). User watches a spinner; no streaming.
- Prompt template (German, short): "Fasse den folgenden Brief in 1–3 Sätzen zusammen. Nenne Absender, Empfänger, Anlass und Kernaussage, falls erkennbar. Text:\n\n{transcription}"
POST /suggest-tags
- Body: { "transcription": "…", "tags": [{ "id": "uuid", "name": "Reise" }, …] } — backend passes the full tag taxonomy so lang-service stays stateless about business data.
- Response: { "suggestions": [{ "tagId": "uuid", "score": 0.71 }, …] } — top 10, sorted by cosine similarity, score ≥ configurable threshold (default 0.35).
- Sub-second once embeddings are cached.
GET /health — 200 once both the embedding model and Ollama are reachable.

Endpoints �� backend

POST /api/documents/{id}/generate-summary
- Permission: WRITE_ALL (same as editing summary manually).
- Requires transcription_blocks to exist for the doc; 409 otherwise.
- Builds the transcription string by joining all blocks in sortOrder.
- Calls lang-service /summarize. Does not mutate Document.summary — returns the generated text so the user can edit before saving via the normal edit form. This keeps the action reversible and matches how manual summaries work.
POST /api/documents/{id}/suggest-tags
- Permission: READ_ALL (suggestion is non-mutating; applying tags still needs WRITE_ALL).
- Returns [{ tagId, tagName, color, score }] — the backend hydrates the tag names/colors from lang-service's id + score response.

No changes to the Document entity or generated API types beyond two new endpoints.

Frontend

Both buttons live on the document detail page (/documents/[id]), disabled with an explanatory tooltip when no transcription exists.

Summary button → calls the backend, shows a spinner ("Zusammenfassung wird generiert…"), then drops the result into the summary field in the metadata edit layout. User can edit and saves via the normal form action. Works on both create-summary-from-scratch and re-generate-over-existing (confirmation modal in the latter case).
Tags button → opens a sheet / side panel with the top-10 ranked suggestions. Each row shows tag chip + similarity score (e.g., "72 %"). Checkbox to add; "Alle übernehmen" to bulk-add. Applies via the existing tag-update endpoint. Already-assigned tags are shown greyed out and excluded from the count.

Tag-embedding cache

On lang-service startup, fetch the tag list via a new GET /api/tags/all backend endpoint (or have the backend POST a warm-up call).
Cache { tagId: embedding_vector } in memory.
Invalidation: when the backend creates/renames/deletes a tag in TagService, it fires a POST /internal/refresh-tags to lang-service after the DB write. Fire-and-forget from the backend's perspective; if it fails, lang-service falls back to lazy refresh on the next /suggest-tags call that sees an unknown tagId.
No DB column for embeddings. Keeps the schema simple; in-memory cache is fine for a taxonomy of <1k tags.

Failure modes

lang-service or ollama down → backend returns 503 with a clear ErrorCode (LANG_SERVICE_UNAVAILABLE); frontend surfaces "KI-Dienst nicht erreichbar, bitte später erneut versuchen".
Summary timeout (>120 s) → 504, same handling.
Transcription too long for the model's context window → truncate to the last N tokens at the middle (keep the letter's opening and closing — most informative) and document that in the response as { summary, truncated: true }. Frontend shows a small warning.

Testing

lang-service unit tests (pure logic):
- prompt builder
- cosine top-N selection with a mock embedding function
- Ollama response parsing
lang-service integration test (mocked Ollama via httpx.MockTransport).
Backend slice tests (@WebMvcTest) for the two new endpoints with a mocked LangServiceClient.
Frontend component tests for both buttons (disabled states, error states).
No end-to-end test against real gemma3:4b — too slow for CI. Manual smoke test documented in the PR description.

Resource budget

Rough RAM on the NAS with the new containers running idle:

ollama (model loaded): ~4 GB
lang-service (embedding model): ~0.7 GB
Current stack (ocr-service + backend + DB + MinIO + frontend + mail): ~14 GB peak
New total idle: ~19 GB peak during a summary run (ollama loads model on first request).

Needs confirmation that the NAS has the headroom.

Implementation order

Add ollama service to docker-compose.yml, pull gemma3:4b, verify curl works from inside the network.
Scaffold lang-service/ (FastAPI, Dockerfile, /health, embedding model loading).
Implement /suggest-tags first (simpler, no LLM dependency).
Implement /summarize.
Backend LangServiceClient + two REST endpoints + error codes.
Frontend: "Tags vorschlagen" button and sheet.
Frontend: "Zusammenfassung generieren" button.
Tag-refresh hook from TagService.
DEPLOYMENT.md update: add model-download step, new env vars.

Each step is an independent commit. No migrations needed.

Out of scope

Fine-tuning any model on family letters (valuable later; requires a small annotated dataset).
Multi-doc summaries (Briefwechsel-wide threads).
Language detection / English summaries — German only for now.
Confidence calibration on tag suggestions beyond a raw similarity threshold.
GPU acceleration — if CPU summary latency is too painful, revisit.

## Goal Give the user two buttons on a document that already has a transcription: **"Zusammenfassung generieren"** and **"Tags vorschlagen"**. Both are user-triggered — nothing runs automatically, so there's no background job load and no surprise costs on the NAS. Everything self-hosted — no external LLM or embedding APIs, for privacy. ## Architecture Two new containers in `docker-compose.yml`: - **`ollama`** — stock Ollama image (`ollama/ollama`). Pulls `gemma3:4b` (quantized, ~3 GB) on first start. Exposes `11434` only on the internal Docker network. Model selection via env var so we can swap later (`qwen2.5:7b`, `llama3.1:8b`, etc.). Volume-mount `ollama_models` so the model survives container recreates. - **`lang-service`** — new FastAPI app under `lang-service/` (mirrors the layout of `ocr-service/`). Owns: - `multilingual-e5-small` sentence-embedding model (~470 MB, CPU, loaded at startup). - HTTP client to Ollama. - Tag-name embedding cache (in-memory, rebuilt on startup and when the backend signals a change — see caching section). Keeping `lang-service` separate from `ocr-service` because (a) the OCR container is already close to its 12 GB memory limit; (b) the two concerns have independent lifecycles (model upgrades, scaling, restarts). ``` Frontend (Svelte) ↓ Backend (Spring Boot) ↓ lang-service (FastAPI) ───────► ollama (gemma3:4b) [summary only] ``` ## Endpoints — `lang-service` - `POST /summarize` - Body: `{ "transcription": "…", "scriptType": "TYPEWRITER" | "HANDWRITING_LATIN" | "HANDWRITING_KURRENT" }` - Response: `{ "summary": "…" }` — single blocking call (timeout ~120 s). User watches a spinner; no streaming. - Prompt template (German, short): "Fasse den folgenden Brief in 1–3 Sätzen zusammen. Nenne Absender, Empfänger, Anlass und Kernaussage, falls erkennbar. Text:\n\n{transcription}" - `POST /suggest-tags` - Body: `{ "transcription": "…", "tags": [{ "id": "uuid", "name": "Reise" }, …] }` — backend passes the full tag taxonomy so `lang-service` stays stateless about business data. - Response: `{ "suggestions": [{ "tagId": "uuid", "score": 0.71 }, …] }` — top 10, sorted by cosine similarity, score ≥ configurable threshold (default 0.35). - Sub-second once embeddings are cached. - `GET /health` — 200 once both the embedding model and Ollama are reachable. ## Endpoints �� backend - `POST /api/documents/{id}/generate-summary` - Permission: `WRITE_ALL` (same as editing summary manually). - Requires `transcription_blocks` to exist for the doc; 409 otherwise. - Builds the transcription string by joining all blocks in `sortOrder`. - Calls `lang-service /summarize`. Does **not** mutate `Document.summary` — returns the generated text so the user can edit before saving via the normal edit form. This keeps the action reversible and matches how manual summaries work. - `POST /api/documents/{id}/suggest-tags` - Permission: `READ_ALL` (suggestion is non-mutating; applying tags still needs `WRITE_ALL`). - Returns `[{ tagId, tagName, color, score }]` — the backend hydrates the tag names/colors from `lang-service`'s id + score response. No changes to the `Document` entity or generated API types beyond two new endpoints. ## Frontend Both buttons live on the document detail page (`/documents/[id]`), disabled with an explanatory tooltip when no transcription exists. - **Summary button** → calls the backend, shows a spinner ("Zusammenfassung wird generiert…"), then drops the result into the summary field in the metadata edit layout. User can edit and saves via the normal form action. Works on both create-summary-from-scratch and re-generate-over-existing (confirmation modal in the latter case). - **Tags button** → opens a sheet / side panel with the top-10 ranked suggestions. Each row shows tag chip + similarity score (e.g., "72 %"). Checkbox to add; "Alle übernehmen" to bulk-add. Applies via the existing tag-update endpoint. Already-assigned tags are shown greyed out and excluded from the count. ## Tag-embedding cache - On `lang-service` startup, fetch the tag list via a new `GET /api/tags/all` backend endpoint (or have the backend POST a warm-up call). - Cache `{ tagId: embedding_vector }` in memory. - Invalidation: when the backend creates/renames/deletes a tag in `TagService`, it fires a `POST /internal/refresh-tags` to `lang-service` after the DB write. Fire-and-forget from the backend's perspective; if it fails, `lang-service` falls back to lazy refresh on the next `/suggest-tags` call that sees an unknown `tagId`. - No DB column for embeddings. Keeps the schema simple; in-memory cache is fine for a taxonomy of <1k tags. ## Failure modes - `lang-service` or `ollama` down → backend returns 503 with a clear `ErrorCode` (`LANG_SERVICE_UNAVAILABLE`); frontend surfaces "KI-Dienst nicht erreichbar, bitte später erneut versuchen". - Summary timeout (>120 s) → 504, same handling. - Transcription too long for the model's context window → truncate to the last N tokens at the middle (keep the letter's opening and closing — most informative) and document that in the response as `{ summary, truncated: true }`. Frontend shows a small warning. ## Testing - `lang-service` unit tests (pure logic): - prompt builder - cosine top-N selection with a mock embedding function - Ollama response parsing - `lang-service` integration test (mocked Ollama via `httpx.MockTransport`). - Backend slice tests (`@WebMvcTest`) for the two new endpoints with a mocked `LangServiceClient`. - Frontend component tests for both buttons (disabled states, error states). - No end-to-end test against real `gemma3:4b` — too slow for CI. Manual smoke test documented in the PR description. ## Resource budget Rough RAM on the NAS with the new containers running idle: - `ollama` (model loaded): ~4 GB - `lang-service` (embedding model): ~0.7 GB - Current stack (`ocr-service` + backend + DB + MinIO + frontend + mail): ~14 GB peak - New total idle: ~19 GB peak during a summary run (ollama loads model on first request). Needs confirmation that the NAS has the headroom. ## Implementation order 1. Add `ollama` service to `docker-compose.yml`, pull `gemma3:4b`, verify `curl` works from inside the network. 2. Scaffold `lang-service/` (FastAPI, Dockerfile, `/health`, embedding model loading). 3. Implement `/suggest-tags` first (simpler, no LLM dependency). 4. Implement `/summarize`. 5. Backend `LangServiceClient` + two REST endpoints + error codes. 6. Frontend: "Tags vorschlagen" button and sheet. 7. Frontend: "Zusammenfassung generieren" button. 8. Tag-refresh hook from `TagService`. 9. DEPLOYMENT.md update: add model-download step, new env vars. Each step is an independent commit. No migrations needed. ## Out of scope - Fine-tuning any model on family letters (valuable later; requires a small annotated dataset). - Multi-doc summaries (Briefwechsel-wide threads). - Language detection / English summaries — German only for now. - Confidence calibration on tag suggestions beyond a raw similarity threshold. - GPU acceleration — if CPU summary latency is too painful, revisit.

marcel added the feature label 2026-04-23 11:11:20 +02:00

marcel referenced this issue

2026-04-24 13:23:11 +02:00

feat(transcribe): guided empty state + Kurrent primer for first-time transcribers #320

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#310

As a user I want to generate a summary and suggest tags from the transcription so I don't have to do both by hand #310