When it comes to recognizing documents from images in Python, there are precious few options, and a couple of good reasons why.
Tesseract is the world’s best OCR solution, and is currently maintained by Google. Unlike other solutions, it comes prepackaged with knowledge for a bunch of languages, so the machine-learning aspects of OCR don’t necessarily have to be a concern of yours, unless you want to recognize for an unknown language, font, potential set of distortions, etc…
However, Tesseract comes as a C++ library, which basically takes it out of the running for use with Python’s ctypes. This isn’t a fault of ctypes, but rather of a lack of standardization in symbol-naming among the C++ compilers (there’s no way to know how to determine the naming for a symbol in the library from Python).
There is an existing Python solution, which comes in the form of a very heavy Python wrapper called python-tesseract, which is built on SWIG. It also requires a couple of extra libraries, like OpenCV and numpy, even if you don’t seem to be using them.
Even if you decide to go the python-tesseract route, you will only have the ability to return the complete document as text, as their support for iteration through the parts of the document is broken (see the bug).
So, with all of that said, we accomplished lightweight access to Tesseract from Python by first building CTesseract (which produces a C wrapper for Tesseract.. see here), and then writing TightOCR (for Python) around that.
This is the result:
from tightocr.adapters.api_adapter import TessApi from tightocr.adapters.lept_adapter import pix_read from tightocr.constants import RIL_PARA t = TessApi(None, 'eng'); p = pix_read('receipt.png') t.set_image_pix(p) t.recognize() if t.mean_text_confidence() < 85: raise Exception("Too much error.") for block in t.iterate(RIL_PARA): print(block)
Of course, you can still recognize the document in one pass, too:
from tightocr.adapters.api_adapter import TessApi from tightocr.adapters.lept_adapter import pix_read from tightocr.constants import RIL_PARA t = TessApi(None, 'eng'); p = pix_read('receipt.png') t.set_image_pix(p) t.recognize() if t.mean_text_confidence() < 85: raise Exception("Too much error.") print(t.get_utf8_text())
With the exception of renaming “mean_text_conf” to “mean_text_confidence”, the library keeps most of the names from the original Tesseract API. So, if you’re comfortable with that, you should have no problem with this (if you even have to do more than the above).
I should mention that the original Tesseract library, though a universal and popular OCR solution, is very dismally documented. Therefore, there are many functions that I’ve left scaffolding for in the project, without being entirely sure how to use/test them nor having any need for them myself. So, I could use help in that area. Just submit issues or pull-requests if you want to contribute.