fix(security): validate file upload MIME type from magic bytes, not client header #84

Open
opened 2026-03-27 09:24:00 +01:00 by marcel · 1 comment
Owner

Security Issue — HIGH

Found in: backend/src/main/java/org/raddatz/familienarchiv/controller/DocumentController.java (lines ~122–143)

The vulnerable pattern

private static final Set<String> ALLOWED_CONTENT_TYPES = Set.of(
        "application/pdf", "image/jpeg", "image/png", "image/tiff");

for (MultipartFile file : files) {
    if (!ALLOWED_CONTENT_TYPES.contains(file.getContentType())) {
        errors.add(new UploadError(file.getOriginalFilename(), "UNSUPPORTED_FILE_TYPE"));
        continue;
    }

file.getContentType() reads the Content-Type value from the multipart HTTP header — which the client controls entirely. An attacker can upload an HTML file containing <script>alert(document.cookie)</script> with the header Content-Type: image/jpeg and it passes the check. The file is stored in MinIO and later served back to users.

Attack path:

  1. Upload evil.html with Content-Type: image/jpeg in the multipart header.
  2. Share the MinIO download link.
  3. If the browser receives it with Content-Type: text/html (or if the stored content-type is later lost), it executes the script — stored XSS via file upload.

The fix

Use Apache Tika to detect the real MIME type from the file's magic bytes, independent of what the client claims. Move the validation into FileService so it cannot be bypassed at the controller level.

Add dependency to backend/pom.xml:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.2</version>
</dependency>

FileService.java:

private static final Set<String> ALLOWED_MIME_TYPES = Set.of(
        "application/pdf", "image/jpeg", "image/png", "image/tiff");

private String detectMimeType(byte[] bytes) throws IOException {
    Tika tika = new Tika();
    return tika.detect(bytes);
}

public UploadResult uploadFile(MultipartFile file, String originalFilename) throws IOException {
    byte[] bytes = file.getBytes();

    String detectedType = detectMimeType(bytes);
    if (!ALLOWED_MIME_TYPES.contains(detectedType)) {
        throw DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED,
                "File type not allowed: " + detectedType);
    }

    // Use the server-detected type — not the client-supplied one
    String s3Key = "documents/" + UUID.randomUUID() + "_" + originalFilename;
    PutObjectRequest putObjectRequest = PutObjectRequest.builder()
            .bucket(bucketName)
            .key(s3Key)
            .contentType(detectedType)   // ← server-detected, not file.getContentType()
            .build();
    // ...
}

Remove the duplicate content-type check from DocumentController — validation now lives in the service.

Why

Magic bytes are the first few bytes of a file that identify its real format — they cannot be faked without corrupting the file itself. A PDF always starts with %PDF, a JPEG with FF D8 FF. The HTTP Content-Type header is just a string the client types — it has no binding relationship to the actual bytes.

Detection in CI

@Test
void rejectsHtmlFileDisguisedAsJpeg() throws Exception {
    byte[] htmlBytes = "<script>alert(1)</script>".getBytes();
    MockMultipartFile f = new MockMultipartFile("files", "evil.jpg", "image/jpeg", htmlBytes);
    mockMvc.perform(multipart("/api/documents/{id}/files", docId).file(f))
            .andExpect(status().isBadRequest());
}

Priority

HIGH — exploitable without authentication bypass, leads to stored XSS.

## Security Issue — HIGH **Found in:** `backend/src/main/java/org/raddatz/familienarchiv/controller/DocumentController.java` (lines ~122–143) ### The vulnerable pattern ```java private static final Set<String> ALLOWED_CONTENT_TYPES = Set.of( "application/pdf", "image/jpeg", "image/png", "image/tiff"); for (MultipartFile file : files) { if (!ALLOWED_CONTENT_TYPES.contains(file.getContentType())) { errors.add(new UploadError(file.getOriginalFilename(), "UNSUPPORTED_FILE_TYPE")); continue; } ``` `file.getContentType()` reads the `Content-Type` value from the multipart HTTP header — which the client controls entirely. An attacker can upload an HTML file containing `<script>alert(document.cookie)</script>` with the header `Content-Type: image/jpeg` and it passes the check. The file is stored in MinIO and later served back to users. **Attack path:** 1. Upload `evil.html` with `Content-Type: image/jpeg` in the multipart header. 2. Share the MinIO download link. 3. If the browser receives it with `Content-Type: text/html` (or if the stored content-type is later lost), it executes the script — stored XSS via file upload. ### The fix Use **Apache Tika** to detect the real MIME type from the file's magic bytes, independent of what the client claims. Move the validation into `FileService` so it cannot be bypassed at the controller level. **Add dependency** to `backend/pom.xml`: ```xml <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.9.2</version> </dependency> ``` **FileService.java:** ```java private static final Set<String> ALLOWED_MIME_TYPES = Set.of( "application/pdf", "image/jpeg", "image/png", "image/tiff"); private String detectMimeType(byte[] bytes) throws IOException { Tika tika = new Tika(); return tika.detect(bytes); } public UploadResult uploadFile(MultipartFile file, String originalFilename) throws IOException { byte[] bytes = file.getBytes(); String detectedType = detectMimeType(bytes); if (!ALLOWED_MIME_TYPES.contains(detectedType)) { throw DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED, "File type not allowed: " + detectedType); } // Use the server-detected type — not the client-supplied one String s3Key = "documents/" + UUID.randomUUID() + "_" + originalFilename; PutObjectRequest putObjectRequest = PutObjectRequest.builder() .bucket(bucketName) .key(s3Key) .contentType(detectedType) // ← server-detected, not file.getContentType() .build(); // ... } ``` Remove the duplicate content-type check from `DocumentController` — validation now lives in the service. ### Why Magic bytes are the first few bytes of a file that identify its real format — they cannot be faked without corrupting the file itself. A PDF always starts with `%PDF`, a JPEG with `FF D8 FF`. The HTTP `Content-Type` header is just a string the client types — it has no binding relationship to the actual bytes. ### Detection in CI ```java @Test void rejectsHtmlFileDisguisedAsJpeg() throws Exception { byte[] htmlBytes = "<script>alert(1)</script>".getBytes(); MockMultipartFile f = new MockMultipartFile("files", "evil.jpg", "image/jpeg", htmlBytes); mockMvc.perform(multipart("/api/documents/{id}/files", docId).file(f)) .andExpect(status().isBadRequest()); } ``` ### Priority **HIGH — exploitable without authentication bypass, leads to stored XSS.**
marcel added the security label 2026-03-27 12:29:50 +01:00
Author
Owner

Audit confirmation (2026-05-07)

Pre-prod audit confirms this is still present. backend/src/main/java/org/raddatz/familienarchiv/filestorage/FileService.java:46-66 reads the client-supplied Content-Type header from MultipartFile without server-side magic-byte detection. DocumentController.java:179-180 validates against an allowlist (application/pdf, image/jpeg, image/png, image/tiff) but trusts the header.

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
</dependency>
String detected = new Tika().detect(file.getInputStream(), file.getOriginalFilename());
if (!ALLOWED_TYPES.contains(detected)) {
    throw DomainException.badRequest(ErrorCode.INVALID_FILE_TYPE,
        "Detected: " + detected);
}

Tika reads the magic bytes (PDF: %PDF-, JPEG: FF D8 FF, PNG: 89 50 4E 47, TIFF: 49 49 2A 00 / 4D 4D 00 2A) and ignores the client-sent header. Pair with Content-Disposition: attachment and the X-Content-Type-Options: nosniff header (already set by Spring Security defaults on /api/*) to prevent browser MIME sniffing on download.

Suggested AC tightening

  • Tika check happens before the file is streamed to MinIO (not after).
  • Test fixture: a .exe renamed to evil.pdf with Content-Type: application/pdf → rejected with 400 + INVALID_FILE_TYPE code.
  • Test fixture: a real PDF accepted regardless of the client-sent Content-Type.

Tracked in audit doc as F-11 (High). See docs/audits/2026-05-07-pre-prod-architectural-review.md.

## Audit confirmation (2026-05-07) Pre-prod audit confirms this is **still present**. `backend/src/main/java/org/raddatz/familienarchiv/filestorage/FileService.java:46-66` reads the client-supplied `Content-Type` header from `MultipartFile` without server-side magic-byte detection. `DocumentController.java:179-180` validates against an allowlist (`application/pdf`, `image/jpeg`, `image/png`, `image/tiff`) but trusts the header. ### Recommended fix — Apache Tika ```xml <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> </dependency> ``` ```java String detected = new Tika().detect(file.getInputStream(), file.getOriginalFilename()); if (!ALLOWED_TYPES.contains(detected)) { throw DomainException.badRequest(ErrorCode.INVALID_FILE_TYPE, "Detected: " + detected); } ``` Tika reads the magic bytes (PDF: `%PDF-`, JPEG: `FF D8 FF`, PNG: `89 50 4E 47`, TIFF: `49 49 2A 00` / `4D 4D 00 2A`) and ignores the client-sent header. Pair with `Content-Disposition: attachment` and the `X-Content-Type-Options: nosniff` header (already set by Spring Security defaults on `/api/*`) to prevent browser MIME sniffing on download. ### Suggested AC tightening - [ ] Tika check happens **before** the file is streamed to MinIO (not after). - [ ] Test fixture: a `.exe` renamed to `evil.pdf` with `Content-Type: application/pdf` → rejected with 400 + `INVALID_FILE_TYPE` code. - [ ] Test fixture: a real PDF accepted regardless of the client-sent Content-Type. Tracked in audit doc as **F-11** (High). See `docs/audits/2026-05-07-pre-prod-architectural-review.md`.
Sign in to join this conversation.
No Label security
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#84