marcel/familienarchiv

Fork 0

Files

Marcel 39eaa10d85

CI / Unit & Component Tests (push) Failing after 2m45s

Details

CI / OCR Service Tests (push) Successful in 34s

Details

CI / Backend Unit Tests (push) Failing after 2m57s

Details

CI / Unit & Component Tests (pull_request) Failing after 2m37s

Details

CI / OCR Service Tests (pull_request) Successful in 33s

Details

CI / Backend Unit Tests (pull_request) Failing after 2m50s

Details

docs(adr): record ADR-004 — PDFBox thumbnails stay in Spring Boot

Captures why thumbnails render in-process rather than being delegated
to ocr-service. Prevents a future reviewer from rehashing the decision
or moving it to the Python side without knowing the trade-offs.

Refs #307

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-22 22:43:27 +02:00

3.2 KiB

Raw Blame History

ADR-004: In-Process PDFBox Thumbnails (not ocr-service)

Status

Accepted

Context

The archive lists documents as text-only rows everywhere (home search, person detail, conversation timeline, Chronik). For a fundamentally visual archive — letters, scans, handwritten pages — this is a real discoverability problem. Issue #307 introduces a small JPEG thumbnail for every document.

A viable alternative to rendering in Spring Boot is delegating to the existing ocr-service (Python), which already has PyMuPDF/PIL available and is the project's designated place for PDF pixel work. The comparison is not obvious: either place works.

Decision

Render thumbnails in-process in Spring Boot using Apache PDFBox 3.0.4 (already a dependency for training-data export). A dedicated thumbnailExecutor pool isolates the work from the shared task pool used by OCR.

PDF first page rendered via PDFRenderer.renderImageWithDPI(0, 100, ImageType.RGB), scaled to 240 px width (bilinear) and encoded as JPEG quality 85.
Non-PDF image types (JPEG, PNG, TIFF) decoded via javax.imageio — TIFF requires the twelvemonkeys-imageio-tiff plugin on the classpath.
Upload paths fire-and-forget via ThumbnailAsyncRunner.dispatchAfterCommit(docId); a ThumbnailBackfillService covers anything the async task missed or that pre-dates this feature.

Alternatives Considered

Alternative	Why rejected
Delegate to `ocr-service` (PyMuPDF)	Adds a network hop and a failure mode to every document upload. `ocr-service` is not guaranteed healthy at upload time (model-loading start period is 60 s). PDFBox is already a backend dependency — delegating is a net complexity increase.
Render on the frontend with `pdfjs-dist` at display time	Would work for PDFs but not for scans / images; list pages would need to render dozens of PDFs on first paint; no server-side caching.
Thumbor / imaginary / a dedicated thumbnail service	Overkill for a single-operator household tool; new container to operate and secure.

Consequences

Easier:

Zero new infrastructure. thumbnails/ is a prefix in the existing MinIO bucket — production migration to Hetzner Object Storage works identically.
Backfill is a plain sequential loop; no inter-service retry semantics.
Integration test runs against real MinIO without needing ocr-service to be healthy.

Harder:

PDFBox is a parser attack surface. Mitigated by a 30-second watchdog timeout in ThumbnailAsyncRunner and by the fire-and-forget contract (failures never break upload).
Memory ceiling: the thumbnailExecutor is capped at 2 threads on the CX32 (8 GB). A busy backfill alongside OCR can approach the 3 GB heap — acceptable but not comfortable. Streaming via FileService.downloadFileStream keeps this bounded for PDFs up to 50 MB.

Future Direction

If a second image-processing job (OCR region crops, sharing previews) arrives, revisit moving all image work to ocr-service so the two share a single PyMuPDF instance.
If thumbnails ever need to be generated at multiple sizes, switch the key pattern from thumbnails/{docId}.jpg to thumbnails/{docId}/{width}.jpg — the endpoint and cache-bust URL are already structured to accommodate that.

3.2 KiB Raw Blame History

ADR-004: In-Process PDFBox Thumbnails (not ocr-service)

Status

Context

Decision

Alternatives Considered

Consequences

Future Direction

3.2 KiB

Raw Blame History