Captures why thumbnails render in-process rather than being delegated to ocr-service. Prevents a future reviewer from rehashing the decision or moving it to the Python side without knowing the trade-offs. Refs #307 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.2 KiB
ADR-004: In-Process PDFBox Thumbnails (not ocr-service)
Status
Accepted
Context
The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document.
A viable alternative to rendering in Spring Boot is delegating to the existing ocr-service (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works.
Decision
Render thumbnails in-process in Spring Boot using Apache PDFBox 3.0.4 (already a dependency for training-data export). A dedicated thumbnailExecutor pool isolates the work from the shared task pool used by OCR.
- PDF first page rendered via
PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB), scaled to 240 px width (bilinear) and encoded as JPEG quality 85. - Non-PDF image types (JPEG, PNG, TIFF) decoded via
javax.imageio— TIFF requires thetwelvemonkeys-imageio-tiffplugin on the classpath. - Upload paths fire-and-forget via
ThumbnailAsyncRunner.dispatchAfterCommit(docId); aThumbnailBackfillServicecovers anything the async task missed or that pre-dates this feature.
Alternatives Considered
| Alternative | Why rejected |
|---|---|
Delegate to ocr-service (PyMuPDF) |
Adds a network hop and a failure mode to every document upload. ocr-service is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase. |
Render on the frontend with pdfjs-dist at display time |
Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching. |
| Thumbor / imaginary / a dedicated thumbnail service | Overkill for a single-operator household tool; new container to operate and secure. |
Consequences
Easier:
- Zero new infrastructure.
thumbnails/is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically. - Backfill is a plain sequential loop; no inter-service retry semantics.
- Integration test runs against real MinIO without needing
ocr-serviceto be healthy.
Harder:
- PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in
ThumbnailAsyncRunnerand by the fire-and-forget contract (failures never break upload). - Memory ceiling: the
thumbnailExecutoris capped at 2 threads on the CX32 (8 GB). A busy backfill alongside OCR can approach the 3 GB heap — acceptable but not comfortable. Streaming viaFileService.downloadFileStreamkeeps this bounded for PDFs up to 50 MB.
Future Direction
- If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to
ocr-serviceso the two share a single PyMuPDF instance. - If thumbnails ever need to be generated at multiple sizes, switch the key pattern from
thumbnails/{docId}.jpgtothumbnails/{docId}/{width}.jpg— the endpoint and cache-bust URL are already structured to accommodate that.