Extract Text From PDF File using Poppler pdftotext

Extract Text From PDF File using Poppler pdftotext

Working with PDF files is common, but sometimes you don't need the entire file - just the text inside it. Copying and pasting from a PDF can be frustrating, especially if the file is large, contains multiple pages. Instead of uploading confidential files to online tools, we can use the pdftotext tool, one of the utilities included in Poppler. This tutorial explains how to extract text from PDF file using Poppler pdftotext.

Prepare environment

Before proceeding, confirm that the Poppler utilities are installed on the system. If you are using Ubuntu, you can follow the installation guide.

Converting entire PDF to text

Download a small example file:

curl -sSo test.pdf https://raw.githubusercontent.com/py-pdf/sample-files/master/004-pdflatex-4-pages/pdflatex-4-pages.pdf

Run the following command to extract the text from every page and write it to a new file:

pdftotext test.pdf test.txt

After running it, you'll find all the text from the PDF saved in test.txt.

Extracting text from page range

In some cases, you may not need to process the entire file. With pdftotext command, you can process specific pages by adding the -f (first page) and -l (last page) options. For example, to pull text only from pages 2 through 4:

pdftotext -f 2 -l 4 test.pdf test.txt

Now, the test.txt file will contain content exclusively from that page range.

Leave a Comment

Cancel reply

Your email address will not be published.