eXXcellent solutions tech blog logoabout

OCR with Tesseract and OpenCV

Cover Image for OCR with Tesseract and OpenCV
Posted
by Fabian Goßner

Motivation

Although we live in a largely digitalised world, there are still cases where we are forced to use paper-based communication. This is not only tedious for ourselves, but also forces the people behind it to use complicated and error-prone optical character recognition to process the information in a digital way.

The easiest way to get the OCR job done is to rely on one of the various cloud providers like AWS, Azure or Google Cloud Platform. Each of them provide ready-to-use api's for performing optical character recognition, but of course this involves sending documents into the cloud, which may be not wanted.

Tesseract, in contrast, is an open source solution for optical character recognition which was originally developed by HP in the 1980s. Since 2006 it is maintained and developed by Google. Nowadays Tesseract is the most popular OCR engine in the open source world, although it is far from being perfect out-of-the-box as it requires a clean and already preprocessed image. But real-world documents will probably never be perfectly aligned nor cleaned, which makes it impossible to retrieve the documents content with Tesseract. To circumvent that I opted to use OpenCV as a preprocessing step to optimize the image for Tesseract.

Image preprocessing with OpenCV

OpenCV is a free library for computer vision and machine learning which provides various algorithms to work with images. Some interesting use cases for OpenCV in the context of OCR include the detection of a document within an image or dewrapping scanned book pages. To showcase the power of Tesseract in combination with OpenCV, imagine you want to extract the text of the following image.

Skewed Text Image;

Applying Tesseract directly on the above image does not return the expected result as you can see below.

»Sei gi, Selbst die Verände, u, ‚die Au dir WÜRSchs; für Oiesa Weit «. Gandhj;

However to still extract the text from the image we can leverage the power of OpenCV, to detect the rotation angle of the text in the image and to correct the angle so that the text is vertically aligned. The preprocessed image and the final output of Tesseract are shown below.

Skew Corrected Image;

„sei du selbst die Veränderung, die du dir wünschst für diese Welt.“ - Mahatma Gandhi

Conclusion

Tesseract in combination with OpenCV has a lot of potential as OpenCV can be used for various optimizing steps before throwing an image into Tesseract. In general every preprocessing step with OpenCV uses some kind of rescaling, blurring, thresholding or Cunny-Edge detection. Unfortunately there is no guideline what to use in which scenario but it has proven it's worth starting with one of the aforementioned techniques.

Image sources

The cover image used in this post was created by Engin Akyurt under the following license. All other images on this page were created by eXXcellent solutions under the terms of the Creative Commons Attribution 4.0 International License