Files
familienarchiv/docs/adr/004-pdfbox-thumbnails.md
Marcel ca93cde06e
All checks were successful
CI / Unit & Component Tests (push) Successful in 3m18s
CI / OCR Service Tests (push) Successful in 21s
CI / Backend Unit Tests (push) Successful in 3m46s
CI / fail2ban Regex (push) Successful in 48s
CI / Semgrep Security Scan (push) Successful in 23s
CI / Compose Bucket Idempotency (push) Successful in 1m6s
docs(infra): correct server specs — Hetzner Serverbörse i7-6700 64 GB, not CX32
Replace all references to the CX32 VPS (8 GB RAM, Hetzner Cloud) with the
actual production server: a Hetzner Serverbörse dedicated server with an
Intel Core i7-6700 (4C/8T, 3.4 GHz) and 64 GB RAM.

Affected files:
- .claude/personas/devops.md — monthly cost line + upgrade example
- docs/infrastructure/production-compose.md — sizing section + cost table
- docs/DEPLOYMENT.md — OCR memory table + OCR_MEM_LIMIT env var description
- docs/adr/004-pdfbox-thumbnails.md — thumbnailExecutor memory ceiling note
- docs/adr/021-tmpdir-persistent-volume-staging.md — OOMKill rationale in alternatives

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 14:51:07 +02:00

4.4 KiB

ADR-004: In-Process PDFBox Thumbnails (not ocr-service)

Status

Accepted

Context

The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document.

A viable alternative to rendering in Spring Boot is delegating to the existing ocr-service (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works.

Decision

Render thumbnails in-process in Spring Boot using Apache PDFBox 3.0.4 (already a dependency for training-data export). A dedicated thumbnailExecutor pool isolates the work from the shared task pool used by OCR.

  • PDF first page rendered via PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB), scaled to 240 px width (bilinear) and encoded as JPEG quality 85.
  • Non-PDF image types (JPEG, PNG, TIFF) decoded via javax.imageio — TIFF requires the twelvemonkeys-imageio-tiff plugin on the classpath.
  • Upload paths fire-and-forget via ThumbnailAsyncRunner.dispatchAfterCommit(docId); a ThumbnailBackfillService covers anything the async task missed or that pre-dates this feature.

Alternatives Considered

Alternative Why rejected
Delegate to ocr-service (PyMuPDF) Adds a network hop and a failure mode to every document upload. ocr-service is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase.
Render on the frontend with pdfjs-dist at display time Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching.
Thumbor / imaginary / a dedicated thumbnail service Overkill for a single-operator household tool; new container to operate and secure.

Consequences

Easier:

  • Zero new infrastructure. thumbnails/ is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically.
  • Backfill is a plain sequential loop; no inter-service retry semantics.
  • Integration test runs against real MinIO without needing ocr-service to be healthy.

Harder:

  • PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in ThumbnailAsyncRunner and by the fire-and-forget contract (failures never break upload).
  • Memory ceiling: the thumbnailExecutor is capped at 2 threads on memory-constrained hosts. A busy backfill alongside OCR can approach the 3 GB heap on an 8 GB server — acceptable but not comfortable. The current production server (64 GB) has ample headroom. Streaming via FileService.downloadFileStream keeps this bounded for PDFs up to 50 MB.

Operational caveats (intentional)

Backfill state is in-memory and single-node. ThumbnailBackfillService.currentStatus is a volatile reference updated on the thumbnail executor thread. Restarting the backend mid-run loses progress and the next runBackfillAsync() starts over. This mirrors MassImportService.ImportStatus and is acceptable because the household archive runs as a single Spring Boot process, backfill is a rare one-shot admin action, and re-running the backfill is idempotent (findByFilePathIsNotNullAndThumbnailKeyIsNull() naturally skips completed documents).

ThumbnailService and ThumbnailBackfillService inject DocumentRepository directly. This is a deliberate exception to the project's "services never reach into another domain's repository" rule. Treating thumbnails as a cross-cutting aspect of Document rather than a sub-domain avoids a circular dependency (DocumentServiceThumbnailAsyncRunnerDocumentService would close the loop). If thumbnail state grows beyond two columns into its own domain model, extract a proper ThumbnailRepository at that point — not before.

Future Direction

  • If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to ocr-service so the two share a single PyMuPDF instance.
  • If thumbnails ever need to be generated at multiple sizes, switch the key pattern from thumbnails/{docId}.jpg to thumbnails/{docId}/{width}.jpg — the endpoint and cache-bust URL are already structured to accommodate that.