feat(search): NL search — auto-summaries and entity extraction per document #744

Closed
opened 2026-06-06 12:17:11 +02:00 by marcel · 0 comments
Owner

Part of epic #735. Post-v1. Depends on translation infrastructure (#741) — same Ollama pipeline.

Goal

Every letter gets a one-sentence gist plus structured extraction of places, dates, events, and people mentioned. This structured output feeds the map view, person-linking, and reading journeys.

Scope

  • Batch job: process all existing transcriptions once; incremental processing for new uploads
  • Per-document output stored in new columns or a document_enrichments table:
    • summary: one-sentence gist in modern German
    • mentioned_places: extracted place names
    • mentioned_events: extracted events/occasions
    • mentioned_persons_raw: names of people mentioned in the letter body (not sender/receiver)
  • Admin UI to trigger/re-trigger enrichment on a document or in bulk
  • Enrichment data surfaces in document detail view and feeds search ranking

Open questions before implementation

  • Storage: new document_enrichments table or extend Document entity?
  • Batch strategy: process all at once (background job) or lazily on first view?
  • Cost/time estimate: how long does one enrichment take on the CPU server? (~1–2s × 5000 letters = hours)
  • Who can trigger batch enrichment? ADMIN only?
Part of epic #735. Post-v1. Depends on translation infrastructure (#741) — same Ollama pipeline. ## Goal Every letter gets a one-sentence gist plus structured extraction of places, dates, events, and people mentioned. This structured output feeds the map view, person-linking, and reading journeys. ## Scope - Batch job: process all existing transcriptions once; incremental processing for new uploads - Per-document output stored in new columns or a `document_enrichments` table: - `summary`: one-sentence gist in modern German - `mentioned_places`: extracted place names - `mentioned_events`: extracted events/occasions - `mentioned_persons_raw`: names of people mentioned in the letter body (not sender/receiver) - Admin UI to trigger/re-trigger enrichment on a document or in bulk - Enrichment data surfaces in document detail view and feeds search ranking ## Open questions before implementation - Storage: new `document_enrichments` table or extend `Document` entity? - Batch strategy: process all at once (background job) or lazily on first view? - Cost/time estimate: how long does one enrichment take on the CPU server? (~1–2s × 5000 letters = hours) - Who can trigger batch enrichment? ADMIN only?
marcel added this to the Archive Intelligence — NL Search milestone 2026-06-06 12:17:11 +02:00
marcel added the P3-laterfeature labels 2026-06-06 12:19:07 +02:00
Sign in to join this conversation.
No Label P3-later feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#744