Tesseract is an optical text recognition (OCR) engine developed by Google. Tesseract is used for recognizing a text in an image. It supports more than 100 languages. Tesseract is an open-source project which released under the Apache License 2.0. We can execute Tesseract directly from the command line. Also, there are many wrappers that allow to use Tesseract with various programming languages.
This tutorial shows how to install Tesseract OCR 5 on Ubuntu 22.04.
Install Tesseract OCR
Add the Tesseract OCR repository:
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
Install Tesseract OCR 5:
sudo apt install -y tesseract-ocr
When installation is finished, we can check Tesseract OCR version:
tesseract --version
Testing Tesseract OCR
Download image from the Internet:
wget https://raw.githubusercontent.com/madmaze/pytesseract/master/tests/data/test.png
Now run tesseract
command to recognize the text in an image. An image filename is provided as first argument. Second argument is output filename which will hold recognized text. We don't need to specify a file extension. A txt
extension appended automatically.
tesseract test.png result
cat result.txt
We can write results to standard output with stdout
argument.
tesseract test.png stdout
Uninstall Tesseract OCR
If you decided to completely remove Tesseract OCR and related dependencies, run the following command:
sudo apt purge --autoremove -y tesseract-ocr
Remove GPG key and repository:
sudo rm -rf /etc/apt/trusted.gpg.d/alex-p-ubuntu-tesseract-ocr5.gpg*
sudo rm -rf /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr5-jammy.list
Leave a Comment
Cancel reply