Extract Text From PDF File using PyPDF2 and Python

May 23, 2023
Python
0 Comments
715 Views

One common task that you may encounter when working with Python is the need to extract text from a PDF file. By extracting the text, you can convert the PDF into a format that is easier to work with, such as a plain text file. This can be particularly useful if you want to analyze the content of the PDF using natural language processing or other techniques. This tutorial shows how to extract text from a PDF file using Python and a library called PyPDF2.

Prepare environment

Install the following package using pip:

pip install PyPDF2

Code

In the following code, we create a PdfReader object by passing the name of the PDF file we want to extract text. Next, we get the total number of pages in the PDF file. We then loop through each page in the PDF file and extract the text. Finally, we print the page number and the extracted text to the console.

from PyPDF2 import PdfReader

reader = PdfReader('sample.pdf')

numPages = len(reader.pages)
for num in range(0, numPages):
    page = reader.pages[num]
    text = page.extract_text()

    print('* Page ' + str(num + 1))
    print(text)

Here's an example of what the output of this code might look like:

* Page 1
Sample text on the first page.
* Page 2
Some more sample text on the second page.
* Page 3
Another page of sample text.

Prepare environment

Code

Related

Leave a Comment