diff --git a/docs/adr/004-pdfbox-thumbnails.md b/docs/adr/004-pdfbox-thumbnails.md new file mode 100644 index 00000000..e102a8b7 --- /dev/null +++ b/docs/adr/004-pdfbox-thumbnails.md @@ -0,0 +1,43 @@ +# ADR-004: In-Process PDFBox Thumbnails (not ocr-service) + +## Status + +Accepted + +## Context + +The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document. + +A viable alternative to rendering in Spring Boot is delegating to the existing `ocr-service` (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works. + +## Decision + +Render thumbnails in-process in Spring Boot using **Apache PDFBox 3.0.4** (already a dependency for training-data export). A dedicated `thumbnailExecutor` pool isolates the work from the shared task pool used by OCR. + +- PDF first page rendered via `PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB)`, scaled to 240 px width (bilinear) and encoded as JPEG quality 85. +- Non-PDF image types (JPEG, PNG, TIFF) decoded via `javax.imageio` — TIFF requires the `twelvemonkeys-imageio-tiff` plugin on the classpath. +- Upload paths fire-and-forget via `ThumbnailAsyncRunner.dispatchAfterCommit(docId)`; a `ThumbnailBackfillService` covers anything the async task missed or that pre-dates this feature. + +## Alternatives Considered + +| Alternative | Why rejected | +|---|---| +| Delegate to `ocr-service` (PyMuPDF) | Adds a network hop and a failure mode to every document upload. `ocr-service` is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase. | +| Render on the frontend with `pdfjs-dist` at display time | Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching. | +| Thumbor / imaginary / a dedicated thumbnail service | Overkill for a single-operator household tool; new container to operate and secure. | + +## Consequences + +**Easier:** +- Zero new infrastructure. `thumbnails/` is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically. +- Backfill is a plain sequential loop; no inter-service retry semantics. +- Integration test runs against real MinIO without needing `ocr-service` to be healthy. + +**Harder:** +- PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in `ThumbnailAsyncRunner` and by the fire-and-forget contract (failures never break upload). +- Memory ceiling: the `thumbnailExecutor` is capped at 2 threads on the CX32 (8 GB). A busy backfill alongside OCR can approach the 3 GB heap — acceptable but not comfortable. Streaming via `FileService.downloadFileStream` keeps this bounded for PDFs up to 50 MB. + +## Future Direction + +- If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to `ocr-service` so the two share a single PyMuPDF instance. +- If thumbnails ever need to be generated at multiple sizes, switch the key pattern from `thumbnails/{docId}.jpg` to `thumbnails/{docId}/{width}.jpg` — the endpoint and cache-bust URL are already structured to accommodate that.