[bug] Scanned PDFs with CCITT/JBIG2 images render blank — pdf.js 5.x wasmUrl not configured #708
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
In the document viewer, some scanned PDFs render blank (no page image) while others render fine. The document thumbnail/preview still shows in both cases, which masks the problem. Root cause: pdf.js 5.x moved the JBIG2 + CCITTFax + JPEG2000 image decoders into WebAssembly, but our renderer never configures the
wasmUrloption, so those decoders fail to initialize and the page paints nothing.This blocks the archive's core read journey for an entire class of documents (bi-level black-and-white scans) — ~16% of all documents, roughly ~1,200 letters (see Blast radius below).
Symptoms
usePdfRenderer.svelte.ts(therenderCurrentPagecatch at theRenderingCancelledExceptionblock justreturns without settingerror), so the user sees a blank canvas rather than any message.Root cause
frontend/src/lib/document/viewer/usePdfRenderer.svelte.tssetsGlobalWorkerOptions.workerSrcbut callsgetDocument(src)with nowasmUrl. In pdf.js 5.5.207:getDocument({ … })accepts awasmUrloption (seepdfjs-dist/build/pdf.mjs:14439, error guard atpdf.mjs:9202— "Ensure that thewasmUrlAPI parameter is provided.").jbig2.wasm) decodes BOTH JBIG2 and CCITTFax — seeJBig2CCITTFaxWasmImage.decode()inpdf.worker.mjs:4156; theCCITTOptionsbranch (pdf.worker.mjs:4182) calls_ccitt_decodeinside the same module.wasmUrlunset,#instantiateWasm(pdf.worker.mjs:4134) fails,decode()resolves the module tonull, andpdf.worker.mjs:4174-4175throwsJBig2Error: JBig2 failed to initialize. The JS-module fallback path builds${wasmUrl}jbig2_nowasm_fallback.js→ withnullbecomes the broken bare specifiernulljbig2_nowasm_fallback.js, so it can't rescue either.Why only some PDFs are affected (verified against staging)
The differentiator is the image codec inside each PDF, not the scan/import workflow. The scanner/converter picks compression per page by content:
C-2703(d8f9bb15-cb67-4c72-85f4-cc5a0c4e3dab)DCTDecode(JPEG), 8-bit RGB, ~504 KBC-3224(bd895525-34f5-4ee4-9a5c-37ceecd7bb37)CCITTFaxDecode(G4 fax), 1-bit DeviceGray, ~29 KBSo JPEG (contone/photo) pages display; CCITT G4 / JBIG2 (bi-level B&W text) pages are blank. Both docs are
%PDF-1.3from the same scanner.Evidence gathered
%PDFheader,%%EOFtrailer, correctapplication/pdf, correct sizes). Backend/api/documents/{id}/fileand DB rows are healthy.pdfjs-dist5.5.207 legacy build loads & rendersC-3224in Node only because it silently falls back to an in-tree JS CCITT decoder; the modern browser build does not fall back (confirmed by the live console error above).node_modules/pdfjs-dist/wasm/ships:jbig2.wasm,openjpeg.wasm,openjpeg_nowasm_fallback.js,qcms_bg.wasm(+ licenses). There is nojbig2_nowasm_fallback.js— JBIG2/CCITT have no JS fallback, so the wasm is mandatory.Blast radius (random sample, n=200 of 7,534 PDFs)
DCTDecode(JPEG)CCITTFaxDecode(G4 fax)JBIG2Decode/JPXDecode/ unclassified~16% of documents affected → roughly ~1,200 letters archive-wide (95% CI ≈ 11–21%). About 1 in 6. The sample found zero true JBIG2 docs — the
JBig2 failed to initializeconsole wording is a red herring: pdf.js routes CCITT through the sharedJBig2CCITTFaxImagewasm module, so a CCITT failure surfaces as aJBig2Error. The affected class is entirely CCITT (G4 fax). Nothing was ever lost — affected docs always had a working download link + server thumbnail.Decisions (resolved — best-practice defaults)
After multi-persona review (see comments), the three open decisions are resolved as:
vite-plugin-static-copy(new devDependency). Chosen over aprebuildscript or committing blobs because this bug is a dev/prod parity failure, and the plugin guarantees parity: it serves/pdfjs-wasm/via dev middleware and emits it to the build output from one config line, reads fromnode_modules(always version-matched), and fails the build loudly if the source dir is absent. It is a devDependency only — never shipped to the runtime image./pdfjs-wasm/with the adapter's default revalidating cache — NOTimmutable.immutableon a non-content-hashed URL would serve a stale.wasmagainst a new worker after a future pdfjs bump. A304on a ~105 KB file is a rounding error at our scale. Revisit version-stamped + immutable only if profiling ever justifies it.Proposed fix
/pdfjs-wasm/viavite-plugin-static-copy, sourced fromnode_modules/pdfjs-dist/wasm/. Include all files (jbig2.wasm,openjpeg.wasm,openjpeg_nowasm_fallback.js,qcms_bg.wasm) —openjpeg.wasmcovers JPEG2000/JPXDecodescans for free and pre-empts a sequel issue. Verify the assets land inbuild/client/and are served by the production Docker image, not justnpm run dev.wasmUrltogetDocumentinusePdfRenderer.svelte.ts, configured once next toworkerSrcininit()(single source of truth, no repeated literal):wasmUrlmust be a directory URL with a trailing slash; pdf.js appendsjbig2.wasmetc.)renderCurrentPage, whentask.promiserejects with anything other thanRenderingCancelledException, seterrorto a localized message (newdoc_render_failedkey inmessages/{de,en,es}.json) — never the raw pdf.jse.message. This routes into the existing error UI (message + download link).rel="noopener noreferrer"to the download<a target="_blank">inDocumentViewer.svelte(CWE-1022).infra/caddy/Caddyfile(security_headers)noting that any futureContent-Security-Policymust includescript-src 'wasm-unsafe-eval'andworker-src 'self' blob:, or PDF rendering breaks again. Reference this issue.Acceptance criteria
User-visible outcomes:
C-3224) renders a visible page image (canvas contains non-background pixels above a sampled threshold).C-2703) still renders — no regression.JPXDecodescan renders (covered byopenjpeg.wasm) — assert if a fixture exists, else explicitly note none was found (none in archive sample).Implementation/ops signals:
wasmUrl/JBig2 failed to initializewarnings in the browser console for affected docs.node build), not justnpm run preview:curl -I /pdfjs-wasm/jbig2.wasm→200+Content-Type: application/wasm.Tests (TDD)
LibLoader(viatestHelpers.makeFakeLibLoader) → assertgetDocumentis called with a non-nullwasmUrlending in/. Red first (currently called with a baresrcstring).PdfViewer.svelte.test.ts): render committed fixtures and assert the canvas is non-blank (sample pixel count, mirror the repro). Fixtures: CCITT (C-3224, 29 KB), JBIG2, DCT (no-regression). Fixtures committed as hermetic test assets — do not fetch from staging at test time.erroris set, the localized message renders, and the download link is present.JBig2 failed to initialize/wasmUrlwarnings appear.Scope boundaries
standardFontDataUrl/iccUrl(also new in pdf.js 5.x) are out of scope unless a specific affected document is found — do not gold-plate. File a follow-up only if evidence appears.VITE_SENTRY_DSNfor the frontend (so client-side decode/render failures surface in GlitchTip instead of dying in the console) is a worthwhile separate observability issue — not part of this fix.🏛️ Markus Keller — Application Architect
Observations
frontend/Dockerfileproduction stage copies only/app/buildand then runsnpm ci --omit=dev --ignore-scripts. The wasm files innode_modules/pdfjs-dist/wasm/are not web-served and won't be reachable at/pdfjs-wasm/at runtime. So the fix must emit the wasm into the SvelteKit client build output (build/client/...). AwasmUrlthat points at node_modules works innpm run devand silently 404s in the Docker image — exactly the kind of dev/prod drift that makes a "fix" look done while staging stays broken.pdfjs-distis a runtimedependency(^5.5.207), so it survives--omit=dev— but that's irrelevant since node_modules isn't on the web path.returninrenderCurrentPage(lines 94-103) is a reliability smell I care about independent of this bug: the system fails invisibly. Push failures up loudly. Proposed fix item #3 is correct and should not be dropped as "nice to have."Recommendations
frontend/static/pdfjs-wasm/(SvelteKit copiesstatic/verbatim intobuild/client/), or copy them there in aprebuildnpm script sourced fromnode_modules/pdfjs-dist/wasm/. Either way the asset is inbuild/and served by adapter-node at/pdfjs-wasm/. This avoids a new build-plugin dependency (see Decision Queue for the tradeoff vs.vite-plugin-static-copy).workerSrcis wired ininit(); putwasmUrlnext to it, not buried inloadDocument. One place configures pdf.js's external assets.npm run preview. The acceptance criterion should read "renders in the built Docker image," because preview and the Node adapter image resolve static assets differently enough to matter here.CONTRIBUTING.mdthat bumpingpdfjs-distrequires re-copying wasm — otherwise the next upgrade reintroduces this exact bug. Cheap insurance for the memory of why.Open Decisions
prebuildnpm script copyingnode_modules/pdfjs-dist/wasm/*→static/pdfjs-wasm/. No new dependency; one script to maintain; drift risk on version bump (mitigated by a CONTRIBUTING note or a build-time existence check).vite-plugin-static-copy(new devDependency). Auto-tracks the source dir, fails the build if absent; one more plugin in the maintenance surface.static/pdfjs-wasm/. Simplest runtime; binary blobs in git; must remember to update on every pdfjs bump (worst drift profile).👨💻 Felix Brandt — Senior Fullstack Developer
Observations
usePdfRenderer.svelte.test.ts) explicitly can't coverinit()/loadDocument()— its own comment says "require pdfjsLib (browser module)". It only tests pure state (clamping, zoom). So a true "page renders" assertion belongs in browser mode (PdfViewer.svelte.test.ts), which runs in CI's Playwright container — good, that path exists.PdfViewer.sveltealready injects alibLoaderprop andtestHelpers.tsexposesmakeFakeLibLoader. That's the seam: I can assertgetDocumentis called with{ url, wasmUrl }without a real browser, using a fake loader that records the call. That's a fast, deterministic red test for fix item #2.returns after distinguishingRenderingCancelledException. Theerrorstring today is set inloadDocumentfrome.message— i.e. a raw pdf.js English string ("Failed to load PDF"). If we start surfacing render failures, we should not leak raw pdf.js text to users (see Leonie/Nora).Recommendations
LibLoader→ assertgetDocumentreceives a non-nullwasmUrlending in/. Red first (currently it's called with a baresrcstring), then green. This guards the regression cheaply and runs outside the browser.C-3224scan) throughPdfViewerand assert the canvas is non-blank (sample pixels, like the repro did: count non-white). Add a JBIG2 fixture too — that's the codec from the actual console error.task.promiserejects with anything other thanRenderingCancelledException, seterrorto a localized string (reusegetErrorMessage/ a newdoc_render_failedmessage inmessages/{de,en,es}.json), note.message. The viewer already renders{error}+ the download link.workerSrcininit), don't repeat the/pdfjs-wasm/literal across files.getDocumentsignature change minimal:getDocument({ url: src, wasmUrl }). Notesrchere is the file URL string fromuseFileLoader— confirm the object form doesn't break the existing data-URL/blob path if any.Open Decisions (none)
🛡️ Nora Steiner ("NullX") — Application Security Engineer
Observations
/pdfjs-wasm/*.wasmserved by our own adapter-node). No SSRF, no third-party CDN, no SRI concern. Serving the wasm from our own origin is the correct security posture — do not be tempted to pointwasmUrlatunpkg/cdnjsto "save a copy step"; that would add a remote-code-execution-via-supply-chain surface for a core viewer path.infra/caddy/Caddyfile(security_headers)ships HSTS,X-Content-Type-Options: nosniff,Referrer-Policy,Permissions-Policy— but noContent-Security-Policy. So nothing blocks wasm today. That's why it'll work, but it's also a latent trap.DocumentViewer.svelte: the failure-state download link istarget="_blank"with norel="noopener noreferrer"(CWE-1022, reverse tabnabbing). Same-origin so low severity, but it's a one-token fix and this issue already touches that component.nosniffis set globally, so the wasm must be served withContent-Type: application/wasmorWebAssembly.instantiateStreamingwill refuse it. adapter-node's static handler sets this from the.wasmextension — verify it survives the Docker build.Recommendations
wasmUrlpointed at our own origin (/pdfjs-wasm/). Never a public CDN for a decoder on the main read path.script-src ... 'wasm-unsafe-eval'(andworker-src 'self' blob:), or this exact decoder breaks again. Drop a comment in the Caddyfile(security_headers)block referencing this issue so the future CSP author doesn't silently re-break PDF rendering.rel="noopener noreferrer"to the download<a target="_blank">while you're in the file.Content-Type: application/wasmon/pdfjs-wasm/jbig2.wasmin the built image (curl -I). Add it to the acceptance checklist.Open Decisions (none — all concrete)
🧪 Sara Holt — QA Engineer & Test Strategist
Observations
mcr.microsoft.com/playwright(infra/gitea/workflows/ci.yml), so a real-render assertion is viable in CI even though browser tests are unreliable locally.Recommendations — test matrix
Cover the codec axis explicitly; one fixture per decode path:
C-3224C-2703libLoaderand asserterrorbecomes set + the localized message + download link render. Today that path is invisible; lock it.LibLoader, assertgetDocumentcalled with non-nullwasmUrl. Fast canary against accidental removal.JBig2 failed to initialize/wasmUrlwarnings appear. The warnings are the cheapest oracle we have.Open Decisions (none)
🎨 Leonie Voss — UI/UX Design Lead & Accessibility Advocate
Observations
return(lines 94-103) is an accessibility failure, not just a code smell.DocumentViewer.sveltealready has a decent error state (message + "Try direct download" link, keysdoc_download_linkexist in de/en/es ✅) — it's just never reached on render failure, only on load failure.erroris set from pdf.js's rawe.message— an untranslated English string ("Failed to load PDF") shown to German-first users. That violates our i18n baseline.Recommendations
doc_render_failedtomessages/{de,en,es}.json, e.g. DE: "Dieser Scan konnte nicht angezeigt werden. Sie können die Datei direkt herunterladen." The download link is already the right escape hatch — make sure it's the focal point of that state.<a>is reachable by Tab and has a visible focus ring (our links should already, but this state is rarely seen — confirm it). It's the only recovery action; it must be operable without a mouse.loadingflag flips on load, not on first paint) — otherwise we trade "blank forever" for "blank, but it looks done." A brief "rendering…" state is honest.text-ink-3onbg-pdf-bg— check it clears WCAG AA (4.5:1) in that dark viewer chrome;ink-3is a muted tone and this is critical recovery copy, not decoration.Open Decisions (none — these are concrete fixes)
🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
node buildfromfrontend/Dockerfile's final stage, which copies only/app/buildand prunes to prod deps with--ignore-scripts. Anything not emitted intobuild/at build time does not exist at runtime. AwasmUrlthat resolves innpm run dev(Vite serves from node_modules) will 404 in staging. This is exactly how we got here with the worker being fine but wasm missing.archiv-staging-frontend-1isfamilienarchiv/frontend:nightly,node build,ORIGIN=https://staging.raddatz.cloud,NODE_ENV=production. No frontend Sentry DSN, so these decode failures were never reported — they died as browser-consolewarns. That's an observability gap of its own.infra/caddy/Caddyfile) just reverse-proxies127.0.0.1:3001→ the Node server; it does not serve or rewrite assets, so/pdfjs-wasm/will pass straight through. Good — no Caddy change needed for the happy path.Recommendations
build/client/at build time.static/pdfjs-wasm/(SvelteKit copiesstatic/into the client build) is the lowest-failure-mode option — no runtime node_modules dependency, no new moving part in the image. If sourced via aprebuildcopy script, add a build-time assertion that the files landed (fail the build loudly ifjbig2.wasmis absent) — a silent-missing-asset is what bit us.curl -I https://.../pdfjs-wasm/jbig2.wasm→ expect200+Content-Type: application/wasm.npm run previewis not the same code path asnode build./pdfjs-wasm/is not content-hashed (unlike_app/immutable/). Do not slapimmutable, max-age=31536000on it, or a future pdfjs bump serves a stale wasm against a new worker → silent breakage. Either version the directory (/pdfjs-wasm/5.5.207/) and cache-bust on bump, or use a shortmax-age+ revalidation. Cheap and avoids a nasty upgrade-day incident.jbig2.wasm105 KB,openjpeg.wasm,qcms_bg.wasm). Negligible. Ship all of them —openjpeg.wasmcovers JPEG2000/JPX scans for free and avoids a sequel issue.VITE_SENTRY_DSNfor the frontend so client-side decode/render failures surface in GlitchTip instead of dying in the console. We were blind to this for an entire class of documents.Open Decisions
/pdfjs-wasm/<version>/, immutable, bust on bump) vs. unversioned path withmax-age=3600, must-revalidate. Versioned is more correct but adds a step on every pdfjs upgrade; unversioned-short is simpler but pays a small revalidation cost per load. (Raised here; low stakes, but pick one deliberately.)📋 Elicit — Requirements Engineer / Business Analyst
Observations
wasmUrlwarnings in console" is an implementation signal; the user-facing requirement is "the page image is visible." Keep both, but label them.mcsweep over the bucket sampling image filters would tell us. Recommend capturing it in the issue.Recommendations — tighten acceptance criteria
<canvas>contains non-background pixels (sampled count > N)" — not "no error." Sara's pixel oracle is the testable form; adopt that wording in the AC.openjpeg.wasm— assert or explicitly defer), and multi-page mixed-codec PDF (a doc where page 1 is JPEG and page 2 is fax — does paging between them work?).standardFontDataUrl/iccUrl(also new in pdf.js 5.x) are out of scope for this issue unless a specific affected document is found — note it so the PR isn't blocked on gold-plating, and file a follow-up only if evidence appears.Open Decisions
🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts. Everything else is concrete recommendation — no need to respond to it.
Build / Architecture
build/client/— the prod Docker image only ships/build, so anode_modules-relativewasmUrlworks indevand 404s in staging.prebuildnpm script copyingpdfjs-dist/wasm/*→static/pdfjs-wasm/+ a build-time existence assertion. No new dependency; small drift risk on pdfjs bump. (Markus's lean, Tobias concurs)vite-plugin-static-copy— new devDependency, auto-tracks the source, fails build if missing.static/pdfjs-wasm/— simplest runtime, binary blobs in git, worst drift profile.(Raised by: Markus, Tobias)
Infrastructure
/pdfjs-wasm/(path is not content-hashed, unlike_app/immutable/):/pdfjs-wasm/<version>/+immutable— correct, bust on every pdfjs upgrade.max-age=3600, must-revalidate— simpler, tiny per-load revalidation cost.Avoid plain
immutableon an unversioned path — a future bump would serve stale wasm against a new worker. (Raised by: Tobias)Product / Communication
Cross-cutting themes the whole panel converged on (not decisions — just do them):
npm run preview(Markus, Sara, Tobias) — this is a dev/prod asset-resolution bug; preview won't reproduce it.openjpeg.wasmtoo (Tobias, Elicit) — covers JPEG2000/JPX scans for free and pre-empts a sequel issue.✅ Decisions resolved + blast radius measured
The three open decisions are resolved with best-practice defaults and folded into the issue body:
vite-plugin-static-copy(dev/prod parity; devDependency only)./pdfjs-wasm/with default revalidating cache (noimmutableon a non-hashed URL).Blast radius (random sample, n=200 of 7,534 PDFs)
DCTDecode(JPEG)CCITTFaxDecode(G4 fax)JBIG2DecodeJPXDecode(JPEG2000)JBig2CCITTFaxImagewasm module, so a CCITT decode failure surfaces as aJBig2Error. The entire affected class is CCITT (G4 fax) — the JBIG2 wording was a red herring from the shared decoder name.Method: 200 randomly-sampled PDFs, codec read from the image XObject
/Filterin the first 250 KB. Full 7,534-doc scan was deliberately avoided (≈2 GB of transfer) — a sample gives a ±5% estimate at negligible cost.✅ Implemented — PR #713
Branch
feat/issue-708-pdfjs-wasmurl(worktree offmain). TDD red→green, 8 atomic commits:8d2ef97fnode_modules/pdfjs-dist/wasm/*at/pdfjs-wasm/viavite-plugin-static-copy(devDep) — emitted intobuild/client/be42e1f0{ url, wasmUrl: '/pdfjs-wasm/' }togetDocument(single constant)5a4b55e3doc_render_failed(de/en/es)aa1e89c2renderCurrentPagesurfaces non-cancellation render failures (no more silent blank)e0eedc70PdfViewererror message + download link6690e137rel="noopener noreferrer"onDocumentViewerdownload link (CWE-1022)cf860193688d3812'wasm-unsafe-eval'+worker-src 'self' blob:Acceptance criteria
wasmUrlis removed.jbig2.wasmmodule (nojbig2enclocally; zero true JBIG2 docs in the archive sample, per the blast-radius study). Documented in the test.openjpeg.wasmis shipped; no fixture asserted (none in archive sample;openjpegdecodes natively in Node so no hermetic synth available) — explicitly noted, not gold-plated.Ops signals
node build(adapter-node, the prod path) serves/pdfjs-wasm/jbig2.wasm→ 200 +application/wasm.npm run buildemitsjbig2.wasm/openjpeg.wasm/qcms_bg.wasmintobuild/client/pdfjs-wasm/.frontend/Dockerfile's production stage, run it,curl -I /pdfjs-wasm/jbig2.wasm.Notes
svelte-checkerror baseline; this PR introduces zero new type errors in touched files.Next: multi-persona PR review on #713.