security(import): validate PDF magic bytes in MassImportService before S3 upload #529

Open
opened 2026-05-11 20:13:32 +02:00 by marcel · 0 comments
Owner

Type: Security hardening (input validation)
Priority: P1-high — prod-reachable file ingestion with no content-type check
Source: review of #526 by Nora "NullX" Steiner — comment #8648, finding 5
Parent PR: #526 (mass-import bind mount → first prod-reachable use)

Summary

Before uploading any file to S3, MassImportService MUST read the first 4 bytes and reject anything that doesn't carry the PDF magic header (%PDF). Rejections MUST be counted and surfaced in ImportStatus.

Context

MassImportService reads filenames from the ODS spreadsheet, walks /import for matching files, and uploads whatever it finds to S3 via S3Client.putObject (MassImportService.java:153, :381). Currently no content-type validation: a file named Letter-1947.pdf could be a renamed .zip (potentially exploiting downstream PDF parsers), a .html (XSS via signed S3 URL preview), or arbitrary bytes — and the OCR/thumbnail pipeline would then process the garbage.

With #526 making this endpoint prod-reachable, the upload-anything behaviour is now an issue.

Required

  1. Before each s3Client.putObject(...) call in MassImportService, read the first 4 bytes of the file and reject anything not matching the PDF magic header (25 50 44 46 = %PDF).
  2. Reject and log the rejection with originalFilename and the actual first bytes (sanitized via SLF4J {} placeholder, per Nora's Log4Shell point).
  3. Surface the rejection in the ImportStatus — counter of skipped files alongside the existing processed counter, so the operator sees what happened.
  4. Permanent regression test: temp dir with one valid PDF (header %PDF-) and one renamed text file (txt-as-pdf.pdf containing not a pdf); assert one upload + one skip + status message lists the rejected filename.

Out of scope

  • MIME-type sniffing beyond the PDF case. The ODS payload is well-defined: ODS spreadsheet + PDFs. Extending to images (JPEG/PNG magic) is a separate decision.

Acceptance criteria

  • Failing-then-passing regression test as above
  • ImportStatus exposes a skipped count and a skippedFiles list (or similar)
  • Frontend admin status card surfaces the skipped count (per Leonie's follow-up #533 too)
  • Rejection log line uses SLF4J {} placeholders, not string concatenation

Linked NFRs

  • Security: File uploads MUST be validated against an expected magic byte signature before being persisted to S3.
  • Observability: Skipped files MUST be visible in ImportStatus (count + filename list).
  • Usability (admin): Operator MUST be able to tell at a glance from the admin status card how many files were skipped and why.

Definition of Ready

  • Problem statement clear
  • Magic byte spec (%PDF) defined
  • Status surface change identified
  • Acceptance criteria testable

🤖 Generated with Claude Code during /implement on #526

**Type:** Security hardening (input validation) **Priority:** P1-high — prod-reachable file ingestion with no content-type check **Source:** review of #526 by Nora "NullX" Steiner — comment [#8648](https://git.raddatz.cloud/marcel/familienarchiv/pulls/526#issuecomment-8648), finding 5 **Parent PR:** #526 (mass-import bind mount → first prod-reachable use) ## Summary Before uploading any file to S3, `MassImportService` MUST read the first 4 bytes and reject anything that doesn't carry the PDF magic header (`%PDF`). Rejections MUST be counted and surfaced in `ImportStatus`. ## Context `MassImportService` reads filenames from the ODS spreadsheet, walks `/import` for matching files, and uploads whatever it finds to S3 via `S3Client.putObject` (`MassImportService.java:153`, `:381`). Currently no content-type validation: a file *named* `Letter-1947.pdf` could be a renamed `.zip` (potentially exploiting downstream PDF parsers), a `.html` (XSS via signed S3 URL preview), or arbitrary bytes — and the OCR/thumbnail pipeline would then process the garbage. With #526 making this endpoint prod-reachable, the upload-anything behaviour is now an issue. ## Required 1. Before each `s3Client.putObject(...)` call in `MassImportService`, read the first 4 bytes of the file and reject anything not matching the PDF magic header (`25 50 44 46` = `%PDF`). 2. Reject and log the rejection with `originalFilename` and the actual first bytes (sanitized via SLF4J `{}` placeholder, per Nora's Log4Shell point). 3. Surface the rejection in the `ImportStatus` — counter of skipped files alongside the existing `processed` counter, so the operator sees what happened. 4. Permanent regression test: temp dir with one valid PDF (header `%PDF-`) and one renamed text file (`txt-as-pdf.pdf` containing `not a pdf`); assert one upload + one skip + status message lists the rejected filename. ## Out of scope - MIME-type sniffing beyond the PDF case. The ODS payload is well-defined: ODS spreadsheet + PDFs. Extending to images (JPEG/PNG magic) is a separate decision. ## Acceptance criteria - [ ] Failing-then-passing regression test as above - [ ] `ImportStatus` exposes a `skipped` count and a `skippedFiles` list (or similar) - [ ] Frontend admin status card surfaces the skipped count (per Leonie's follow-up #533 too) - [ ] Rejection log line uses SLF4J `{}` placeholders, not string concatenation ## Linked NFRs - **Security:** File uploads MUST be validated against an expected magic byte signature before being persisted to S3. - **Observability:** Skipped files MUST be visible in `ImportStatus` (count + filename list). - **Usability (admin):** Operator MUST be able to tell at a glance from the admin status card how many files were skipped and why. ## Definition of Ready - [x] Problem statement clear - [x] Magic byte spec (`%PDF`) defined - [x] Status surface change identified - [x] Acceptance criteria testable 🤖 Generated with [Claude Code](https://claude.com/claude-code) during /implement on #526
marcel added the P1-highfile-uploadsecurity labels 2026-05-11 20:36:40 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#529