security(import): validate PDF magic bytes before S3 upload #618

Merged
marcel merged 10 commits from worktree-feat+issue-529-pdf-magic-bytes into main 2026-05-19 09:45:04 +02:00

10 Commits

Author SHA1 Message Date
Marcel
2f30e847a7 fix(import): address non-blocking review feedback — touch target, glossary, edge-case test
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m27s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m9s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
CI / fail2ban Regex (pull_request) Successful in 1m35s
- Add min-h-[44px] py-2 to <summary> in ImportStatusCard for 44 px touch target
- Add SkippedFile and skipped count entries to docs/GLOSSARY.md
- Add MassImportServiceTest case: ALREADY_EXISTS fires before file I/O when doc is UPLOADED and file is present on disk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 08:45:39 +02:00
Marcel
9443d8e668 fix(import): surface S3 failures + already-exists in skippedFiles, a11y + max-height
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m21s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m6s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
- Change importSingleDocument return type from boolean to Optional<String>
  so callers in processRows receive the skip reason on every non-success path.
  S3 upload failures now surface as "S3_UPLOAD_FAILED" and already-imported
  documents as "ALREADY_EXISTS" in the skippedFiles list shown in the admin UI.
- Add two new tests: runImportAsync_addsS3UploadFailed_toSkippedFiles and
  runImportAsync_addsAlreadyExists_toSkippedFiles; update
  importSingleDocument_skips_whenDocumentAlreadyUploadedNotPlaceholder and
  the S3-failure test to assert on the Optional return value.
- Add i18n keys for S3_UPLOAD_FAILED and ALREADY_EXISTS in de/en/es messages.
- Svelte ImportStatusCard: add aria-hidden="true" to SVG chevron, wrap
  conditional warning section in aria-live="polite" div, add max-h-64
  overflow-y-auto to skipped-files <ul> to cap height on large batches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 07:44:12 +02:00
Marcel
f144b025b0 fix(import): address round-3 review concerns
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m13s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m8s
CI / fail2ban Regex (pull_request) Successful in 41s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
- Add comment to openFileStream() explaining package-private visibility
  is intentional (Mockito spy seam for IOException test)
- Key {#each} skippedFiles by filename instead of array index
- Add test: skipped section hidden when state is FAILED
- Add test: reasonLabel returns raw code for unknown reason strings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 19:04:06 +02:00
Marcel
b38b8c39b8 fix(admin): address round-2 review concerns on ImportStatusCard
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m9s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m1s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
- Use loop index as each key (handles duplicate filenames)
- Increase skipped filename font from text-xs to text-sm
- Add motion-safe guard to details chevron transition
- Replace text-warning with text-amber-900 to meet WCAG AA contrast

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 16:31:30 +02:00
Marcel
93da5914a8 test(admin): add regression test for skipped section hidden during RUNNING
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 16:30:43 +02:00
Marcel
53dba772ae fix(import): add @Schema annotations and fix IOException test coverage
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m2s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m14s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
- Add @Schema(requiredMode = REQUIRED) to SkippedFile and ImportStatus
  record components so TypeScript codegen produces non-optional fields
  when generate:api is next run
- Extract openFileStream(File) as package-private method so the
  IOException path can be tested deterministically without relying on
  OS-level file permissions (which are bypassed when running as root)
- Replace assumeTrue-based IOException test with Mockito spy that stubs
  openFileStream — test now runs in CI unconditionally (45 tests, 0 skipped)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 15:12:02 +02:00
Marcel
e770b81ea5 fix(test): skip IOException test when running as root
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m4s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m14s
CI / fail2ban Regex (pull_request) Successful in 47s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
setReadable(false) silently no-ops as root; check canRead() to guard
the assumption correctly so the test is skipped in Docker CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 14:58:34 +02:00
Marcel
50441f558f fix(import): address PR review concerns
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m3s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Failing after 3m2s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
- remove duplicate List import in AdminControllerTest
- derive skipped() from skippedFiles.size() — drop redundant int field
- use machine codes for SkippedFile.reason (INVALID_PDF_SIGNATURE, FILE_READ_ERROR)
- map reason codes to i18n strings in ImportStatusCard (de/en/es)
- replace raw amber Tailwind classes with warning semantic token
- fix <summary> accessibility: replace list-none with rotating chevron SVG
- replace <p> with <span> inside <summary> (phrasing content rule)
- extract setupOneValidOneFakeImport() helper — remove 3x copy-paste
- add lenient mock to short-file test for defensive coverage
- add IOException path test for isPdfMagicBytes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 14:45:21 +02:00
Marcel
22589e4729 feat(admin): surface skipped file count in ImportStatusCard
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m4s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m2s
CI / fail2ban Regex (pull_request) Successful in 39s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
Adds SkippedFile to the local ImportStatus type and updates
ImportStatusCard to show an amber skipped-count section with a
collapsible filename list in the DONE state. Only rendered when
skipped > 0. i18n keys added for de/en/es.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 13:14:32 +02:00
Marcel
e124c68cf4 feat(import): validate PDF magic bytes before S3 upload
Reads first 4 bytes of each candidate file before upload; rejects any
file whose header does not match %PDF (0x25 0x50 0x44 0x46). Skipped
files are counted and collected in ImportStatus.skippedFiles so
operators can see what was rejected without querying Loki.

Breaking: ImportStatus record gains skipped + skippedFiles fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 12:51:29 +02:00