Install Tesseract OCR 5 on Ubuntu 20.04

Tesseract is an optical text recognition (OCR) engine developed by Google. Tesseract is used for recognizing a text in image. It supports more than 100 languages. Tesseract is an open-source project which released under the Apache License 2.0. We can execute Tesseract directly from the command line. Also, there are many wrappers that allows to use Tesseract with various programming languages.

Tesseract OCR 5 still in development and not release yet. However, it can be installed through tesseract-ocr-devel PPA repository. This tutorial shows how to install Tesseract OCR 5 on Ubuntu 20.04.

Install Tesseract OCR

Add the Tesseract OCR repository:

sudo add-apt-repository -y ppa:alex-p/tesseract-ocr-devel

Install Tesseract OCR 5:

sudo apt install -y tesseract-ocr

When installation is finished, we can check Tesseract OCR version:

tesseract --version

Testing Tesseract OCR

Download image from the Internet:

wget https://raw.githubusercontent.com/madmaze/pytesseract/master/tests/data/test.png

Now run tesseract command to recognize the text in an image. An image filename is provided as first argument. Second argument is output filename which will hold recognized text. We don’t need to specify a file extension. A txt extension appended automatically.

tesseract test.png result
cat result.txt

We can write results to standard output with stdout argument.

tesseract test.png stdout

Uninstall Tesseract OCR

If you decided to completely remove Tesseract OCR and related dependencies, run the following command:

sudo apt purge --autoremove -y tesseract-ocr

Remove GPG key and repository:

sudo rm -rf /etc/apt/trusted.gpg.d/alex-p_ubuntu_tesseract-ocr-devel.gpg
sudo rm -rf /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-devel-focal.list

Leave a Comment

Your email address will not be published. Required fields are marked *