Tesseract is a terrific, trainable (optionally) OCR library currently maintained by Google. However, the only currently-sufficient way to use it from Python is via python-tesseract (a third-party library), and it has two flaws.
The first flaw is that python-tesseract is based on SWIG, and it introduces a lot more code. The second is that the functions may not be functionally compatible. For example, Tesseract will let you iterate through a document by “level” (word, line, paragraph, block, etc..), and allow you to incorporate its layout analysis into your application. This is useful if you need to extract parts of a document based on proximity (or, possibly, location). However, python-tesseract does not currently let you iterate through parts of the document: GetIterator() does not accept a level argument.
So, as a first step to producing a leaner and more analogous Python library, I just released CTesseract: a C-based adapter shared-library that connects to the C++ Tesseract shared-library.
One thought on “New Tesseract OCR C Library”
Comments are closed.