Files
familienarchiv/docs/adr/004-pdfbox-thumbnails.md
Marcel f137aa79a2 docs(adr): document layering exception and in-memory backfill state
Addresses @mkeller (Markus) — fixes(adr): "the ADR doesn't mention
in-memory BackfillStatus" and "treat this as a layering exception,
acknowledge it explicitly". Two new paragraphs under Operational caveats.

Refs #307

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 22:58:36 +02:00

4.3 KiB

ADR-004: In-Process PDFBox Thumbnails (not ocr-service)

Status

Accepted

Context

The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document.

A viable alternative to rendering in Spring Boot is delegating to the existing ocr-service (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works.

Decision

Render thumbnails in-process in Spring Boot using Apache PDFBox 3.0.4 (already a dependency for training-data export). A dedicated thumbnailExecutor pool isolates the work from the shared task pool used by OCR.

  • PDF first page rendered via PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB), scaled to 240 px width (bilinear) and encoded as JPEG quality 85.
  • Non-PDF image types (JPEG, PNG, TIFF) decoded via javax.imageio — TIFF requires the twelvemonkeys-imageio-tiff plugin on the classpath.
  • Upload paths fire-and-forget via ThumbnailAsyncRunner.dispatchAfterCommit(docId); a ThumbnailBackfillService covers anything the async task missed or that pre-dates this feature.

Alternatives Considered

Alternative Why rejected
Delegate to ocr-service (PyMuPDF) Adds a network hop and a failure mode to every document upload. ocr-service is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase.
Render on the frontend with pdfjs-dist at display time Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching.
Thumbor / imaginary / a dedicated thumbnail service Overkill for a single-operator household tool; new container to operate and secure.

Consequences

Easier:

  • Zero new infrastructure. thumbnails/ is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically.
  • Backfill is a plain sequential loop; no inter-service retry semantics.
  • Integration test runs against real MinIO without needing ocr-service to be healthy.

Harder:

  • PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in ThumbnailAsyncRunner and by the fire-and-forget contract (failures never break upload).
  • Memory ceiling: the thumbnailExecutor is capped at 2 threads on the CX32 (8 GB). A busy backfill alongside OCR can approach the 3 GB heap — acceptable but not comfortable. Streaming via FileService.downloadFileStream keeps this bounded for PDFs up to 50 MB.

Operational caveats (intentional)

Backfill state is in-memory and single-node. ThumbnailBackfillService.currentStatus is a volatile reference updated on the thumbnail executor thread. Restarting the backend mid-run loses progress and the next runBackfillAsync() starts over. This mirrors MassImportService.ImportStatus and is acceptable because the household archive runs as a single Spring Boot process, backfill is a rare one-shot admin action, and re-running the backfill is idempotent (findByFilePathIsNotNullAndThumbnailKeyIsNull() naturally skips completed documents).

ThumbnailService and ThumbnailBackfillService inject DocumentRepository directly. This is a deliberate exception to the project's "services never reach into another domain's repository" rule. Treating thumbnails as a cross-cutting aspect of Document rather than a sub-domain avoids a circular dependency (DocumentServiceThumbnailAsyncRunnerDocumentService would close the loop). If thumbnail state grows beyond two columns into its own domain model, extract a proper ThumbnailRepository at that point — not before.

Future Direction

  • If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to ocr-service so the two share a single PyMuPDF instance.
  • If thumbnails ever need to be generated at multiple sizes, switch the key pattern from thumbnails/{docId}.jpg to thumbnails/{docId}/{width}.jpg — the endpoint and cache-bust URL are already structured to accommodate that.