docs(adr): record ADR-004 — PDFBox thumbnails stay in Spring Boot
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m45s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m57s
CI / Unit & Component Tests (pull_request) Failing after 2m37s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Failing after 2m50s
Some checks failed
CI / Unit & Component Tests (push) Failing after 2m45s
CI / OCR Service Tests (push) Successful in 34s
CI / Backend Unit Tests (push) Failing after 2m57s
CI / Unit & Component Tests (pull_request) Failing after 2m37s
CI / OCR Service Tests (pull_request) Successful in 33s
CI / Backend Unit Tests (pull_request) Failing after 2m50s
Captures why thumbnails render in-process rather than being delegated to ocr-service. Prevents a future reviewer from rehashing the decision or moving it to the Python side without knowing the trade-offs. Refs #307 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
43
docs/adr/004-pdfbox-thumbnails.md
Normal file
43
docs/adr/004-pdfbox-thumbnails.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# ADR-004: In-Process PDFBox Thumbnails (not ocr-service)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document.
|
||||
|
||||
A viable alternative to rendering in Spring Boot is delegating to the existing `ocr-service` (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works.
|
||||
|
||||
## Decision
|
||||
|
||||
Render thumbnails in-process in Spring Boot using **Apache PDFBox 3.0.4** (already a dependency for training-data export). A dedicated `thumbnailExecutor` pool isolates the work from the shared task pool used by OCR.
|
||||
|
||||
- PDF first page rendered via `PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB)`, scaled to 240 px width (bilinear) and encoded as JPEG quality 85.
|
||||
- Non-PDF image types (JPEG, PNG, TIFF) decoded via `javax.imageio` — TIFF requires the `twelvemonkeys-imageio-tiff` plugin on the classpath.
|
||||
- Upload paths fire-and-forget via `ThumbnailAsyncRunner.dispatchAfterCommit(docId)`; a `ThumbnailBackfillService` covers anything the async task missed or that pre-dates this feature.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
| Alternative | Why rejected |
|
||||
|---|---|
|
||||
| Delegate to `ocr-service` (PyMuPDF) | Adds a network hop and a failure mode to every document upload. `ocr-service` is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase. |
|
||||
| Render on the frontend with `pdfjs-dist` at display time | Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching. |
|
||||
| Thumbor / imaginary / a dedicated thumbnail service | Overkill for a single-operator household tool; new container to operate and secure. |
|
||||
|
||||
## Consequences
|
||||
|
||||
**Easier:**
|
||||
- Zero new infrastructure. `thumbnails/` is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically.
|
||||
- Backfill is a plain sequential loop; no inter-service retry semantics.
|
||||
- Integration test runs against real MinIO without needing `ocr-service` to be healthy.
|
||||
|
||||
**Harder:**
|
||||
- PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in `ThumbnailAsyncRunner` and by the fire-and-forget contract (failures never break upload).
|
||||
- Memory ceiling: the `thumbnailExecutor` is capped at 2 threads on the CX32 (8 GB). A busy backfill alongside OCR can approach the 3 GB heap — acceptable but not comfortable. Streaming via `FileService.downloadFileStream` keeps this bounded for PDFs up to 50 MB.
|
||||
|
||||
## Future Direction
|
||||
|
||||
- If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to `ocr-service` so the two share a single PyMuPDF instance.
|
||||
- If thumbnails ever need to be generated at multiple sizes, switch the key pattern from `thumbnails/{docId}.jpg` to `thumbnails/{docId}/{width}.jpg` — the endpoint and cache-bust URL are already structured to accommodate that.
|
||||
Reference in New Issue
Block a user