OCR with Tesseract and OpenCV
Motivation
Although we live in a largely digitalised world, there are still cases where we are forced to use paper-based communication. This is not only tedious for ourselves, but also forces the people behind it to use complicated and error-prone optical character recognition to process the information in a digital way.
The easiest way to get the OCR job done is to rely on one of the various cloud providers like AWS, Azure or Google Cloud Platform. Each of them provide ready-to-use api's for performing optical character recognition, but of course this involves sending documents into the cloud, which may be not wanted.
Tesseract, in contrast, is an open source solution for optical character recognition which was originally developed by HP in the 1980s. Since 2006 it is maintained and developed by Google. Nowadays Tesseract is the most popular OCR engine in the open source world, although it is far from being perfect out-of-the-box as it requires a clean and already preprocessed image. But real-world documents will probably never be perfectly aligned nor cleaned, which makes it impossible to retrieve the documents content with Tesseract. To circumvent that I opted to use OpenCV as a preprocessing step to optimize the image for Tesseract.
Image preprocessing with OpenCV
OpenCV is a free library for computer vision and machine learning which provides various algorithms to work with images. Some interesting use cases for OpenCV in the context of OCR include the detection of a document within an image or dewrapping scanned book pages. To showcase the power of Tesseract in combination with OpenCV, imagine you want to extract the text of the following image.
Applying Tesseract directly on the above image does not return the expected result as you can see below.
»Sei gi, Selbst die Verände, u, ‚die Au dir WÜRSchs; für Oiesa Weit «. Gandhj;
However to still extract the text from the image we can leverage the power of OpenCV, to detect the rotation angle of the text in the image and to correct the angle so that the text is vertically aligned. The preprocessed image and the final output of Tesseract are shown below.
„sei du selbst die Veränderung, die du dir wünschst für diese Welt.“ - Mahatma Gandhi
Conclusion
Tesseract in combination with OpenCV has a lot of potential as OpenCV can be used for various optimizing steps before throwing an image into Tesseract. In general every preprocessing step with OpenCV uses some kind of rescaling, blurring, thresholding or Cunny-Edge detection. Unfortunately there is no guideline what to use in which scenario but it has proven it's worth starting with one of the aforementioned techniques.
Image sources
The cover image used in this post was created by Engin Akyurt under the following license. All other images on this page were created by eXXcellent solutions under the terms of the Creative Commons Attribution 4.0 International License