Extract Images From PDF File using Poppler pdfimages

Extract Images From PDF File using Poppler pdfimages

Working with PDF documents often means dealing with more than just text - many files also contain embedded images. Instead of taking screenshots or relying on unreliable online converters, we can use pdfimages, a Poppler tool designed specifically for pulling out images directly from a PDF file. This tutorial explains how to extract images from a PDF file using Poppler pdfimages.

Prepare environment

First, make sure you have the Poppler utilities installed on the system. If you are using Ubuntu, you can follow the installation guide.

Extracting images from PDF

Let's start by downloading a sample file:

curl -sSo test.pdf https://raw.githubusercontent.com/lazyFrogLOL/llmdocparser/master/llmdocparser/example/attention_is_all_you_need.pdf

To extract all images embedded in this document and save them as PNG files, run:

pdfimages -png test.pdf images

This command will generate a series of files named sequentially, for example:

images-000.png images-001.png images-002.png images-003.png images-004.png images-005.png

Extracting images from page range

Sometimes you only need images from a section of the PDF rather than the entire document. The -f and -l options let you specify the first and last pages to process. For example, to get images only from pages 2 through 4, run:

pdfimages -png -f 2 -l 4 test.pdf images

This will extract images only from the chosen page range, leaving the rest untouched.

Leave a Comment

Cancel reply

Your email address will not be published.