re;file labs
PDF Tools

How to Extract Images from a PDF (Without Re-encoding Them)

PDFs store images as embedded objects inside the file. Here's how to pull them out without quality loss, no upload required.

How to Extract Images from a PDF (Without Re-encoding Them)

PDFs hold images as embedded binary objects inside the file. When you want those images back out, most tools take the lazy route: render the page to a bitmap and screenshot it. You lose quality, you lose format information, and you get a PNG of whatever DPI the renderer decided to use. The right approach reads the embedded objects directly and writes them out as-is.

The re;file labs PDF image extractor does that. Drop in a PDF, and it pulls the embedded images out of the file, preserving the original encoding where possible. Everything runs locally in your browser via WebAssembly — nothing is uploaded anywhere.

How PDFs Actually Store Images

A PDF isn't a flat canvas. It's a structured document containing objects: fonts, content streams, and image streams. Each image in a PDF is a separate object with its own stream of bytes, compression filter, color space, and dimensions.

The three most common encodings you'll encounter:

  • DCTDecode — JPEG. The JPEG bitstream is stored directly in the PDF. When you extract it, you get the original JPEG bytes back, with no quality loss.
  • FlateDecode — Zlib-compressed raw pixel data. Common for PNG-style images. The extractor decompresses it and converts to RGBA for export.
  • JPXDecode — JPEG 2000. Less common, mostly in PDFs generated by Adobe products. Stored as a raw JP2 bitstream.

Scanned documents are a special case. They're typically stored as a single large image per page (often DCTDecode), so extracting "images from a scanned PDF" gives you one image per page at whatever resolution the scan was done. That's expected behavior, not a bug.

Using the Extractor

Go to the PDF image extractor, drop in a PDF, and it immediately scans the file for image objects. Each extracted image appears in a grid with its page number, index on that page, format (JPEG, JP2, or PNG), and pixel dimensions.

Download options:

  • Individual download — click any image in the grid, or hover it for the download button
  • Download all individually — triggers browser downloads for every image at once
  • ZIP (flat) — everything in a single folder, named page1_image1.jpg, page1_image2.jpg, etc.
  • ZIP (by page) — organized into subfolders: page-1/image1.jpg, page-2/image1.jpg

The ZIP option is the practical one for PDFs with many images. Triggering 40 individual browser downloads for a product catalog is not something you want to do.

What Happens to CMYK Images

PDF images can use color spaces your browser doesn't natively understand: DeviceGray, DeviceRGB, DeviceCMYK, CalGray, CalRGB, Lab, and ICCBased among them. JPEG images stored in CMYK (common for print-workflow PDFs) get converted to RGBA on export, since browsers can't display CMYK JPEGs directly.

The conversion formula is straightforward:

R = (1 - C) × (1 - K) × 255
G = (1 - M) × (1 - K) × 255
B = (1 - Y) × (1 - K) × 255

Grayscale images get expanded to RGBA too, with the same value in R, G, and B channels. If you're doing print work and need the raw CMYK data, this extractor won't give you that — you'd need a tool that preserves ICC profiles and CMYK channels end-to-end.

Why Images Sometimes Look Wrong or Go Missing

A few common failure modes:

Mask images. PDFs can embed soft masks or stencil masks as separate image objects. These show up as extracted images but look like noise or solid blocks. They're not bugs in the extractor — they're legitimate embedded objects that only make visual sense when composited with the image they mask. The PDF spec has about 800 pages of this energy.

Tiled patterns. Some PDFs use tiling pattern color spaces where what appears as a large image is actually a small tile repeated across the page. Each tile is a tiny image object. You'll get many small extracts instead of one large one.

Encrypted or protected PDFs. Password-protected PDFs need to be unlocked before any object extraction can happen. The extractor won't attempt to brute-force or bypass encryption — just decrypt the file first.

Unusual filters. Filters like LZWDecode, CCITTFaxDecode (used in fax-style scans), and JBIG2Decode (black-and-white bitmaps, common in scanned legal documents) aren't handled. Images using those filters are skipped. This covers a small fraction of real-world PDFs but does exist.

How It Runs Locally

The extraction logic is a Rust function compiled to WebAssembly. It uses the lopdf crate to parse the PDF structure and iterate over page image objects, then handles decompression and color space conversion before passing the result back to JavaScript.

No server is involved at any point. The PDF bytes are read into a Uint8Array in the browser, passed directly to the WASM module, and the extracted image data comes back as typed arrays. The whole process happens in memory on your device.

That matters for PDFs containing sensitive content — internal documents, contracts, reports. Sending those to a third-party server to extract a logo or chart is a bad trade. The local approach sidesteps that entirely.

Comparing to Other Approaches

Adobe Acrobat (Pro, not Reader) has an Export Images tool that does essentially the same thing. It handles more edge cases and preserves ICC profiles. If you have Acrobat Pro and need production-quality extraction for print workflows, use that.

pdfimages (part of Poppler, available via brew install poppler on Mac or most Linux package managers) is the command-line equivalent:

# Extract all images, preserving native format where possible
pdfimages -all input.pdf output/prefix

# List what's in the PDF without extracting
pdfimages -list input.pdf

The -all flag preserves JPEG encoding. Without it, pdfimages decodes everything to PPM — the format nobody asked for but everyone eventually receives. For batch processing or scripted workflows, pdfimages is the right tool.

Online tools like Smallpdf, ILovePDF, and PDF24 all upload your file to their servers. That's fine for non-sensitive PDFs where convenience matters more. For anything confidential, it's worth using a local option.

The re;file labs extractor sits in the middle: browser-based convenience, local-only processing. No install, no upload, no account.

Try the PDF image extractor and see what's actually embedded in your PDFs.