Use TightOCR for Easy OCR from Python

When it comes to recognizing documents from images in Python, there are precious few options, and a couple of good reasons why.

Tesseract is the world’s best OCR solution, and is currently maintained by Google. Unlike other solutions, it comes prepackaged with knowledge for a bunch of languages, so the machine-learning aspects of OCR don’t necessarily have to be a concern of yours, unless you want to recognize for an unknown language, font, potential set of distortions, etc…

However, Tesseract comes as a C++ library, which basically takes it out of the running for use with Python’s ctypes. This isn’t a fault of ctypes, but rather of a lack of standardization in symbol-naming among the C++ compilers (there’s no way to know how to determine the naming for a symbol in the library from Python).

There is an existing Python solution, which comes in the form of a very heavy Python wrapper called python-tesseract, which is built on SWIG. It also requires a couple of extra libraries, like OpenCV and numpy, even if you don’t seem to be using them.

Even if you decide to go the python-tesseract route, you will only have the ability to return the complete document as text, as their support for iteration through the parts of the document is broken (see the bug).

So, with all of that said, we accomplished lightweight access to Tesseract from Python by first building CTesseract (which produces a C wrapper for Tesseract.. see here), and then writing TightOCR (for Python) around that.

This is the result:

from tightocr.adapters.api_adapter import TessApi
from tightocr.adapters.lept_adapter import pix_read
from tightocr.constants import RIL_PARA

t = TessApi(None, 'eng');
p = pix_read('receipt.png')
t.set_image_pix(p)
t.recognize()

if t.mean_text_confidence() < 85:
    raise Exception("Too much error.")

for block in t.iterate(RIL_PARA):
    print(block)

Of course, you can still recognize the document in one pass, too:

from tightocr.adapters.api_adapter import TessApi
from tightocr.adapters.lept_adapter import pix_read
from tightocr.constants import RIL_PARA

t = TessApi(None, 'eng');
p = pix_read('receipt.png')
t.set_image_pix(p)
t.recognize()

if t.mean_text_confidence() < 85:
    raise Exception("Too much error.")

print(t.get_utf8_text())

With the exception of renaming “mean_text_conf” to “mean_text_confidence”, the library keeps most of the names from the original Tesseract API. So, if you’re comfortable with that, you should have no problem with this (if you even have to do more than the above).

I should mention that the original Tesseract library, though a universal and popular OCR solution, is very dismally documented. Therefore, there are many functions that I’ve left scaffolding for in the project, without being entirely sure how to use/test them nor having any need for them myself. So, I could use help in that area. Just submit issues or pull-requests if you want to contribute.

Advertisements

OCR a Document by Section

A previous post described how to use Tesseract to OCR a document to a single string. This post will describe how to take advantage of Tesseract’s internal layout processing to iterate through the documents sections (as determined by Tesseract).

This is the core logic to iterate through the document. Each section is referred to as a “paragraph”:

if(api->Recognize(NULL) < 0)
{
    api->End();
    pixDestroy(&image);

    return 3;
}

tesseract::ResultIterator *it = api->GetIterator();

do 
{
    if(it->Empty(tesseract::RIL_PARA))
        continue;

    char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);

// para_text is the recognized text. It [usually] has a 
// newline on the end.

    delete[] para_text;
} while (it->Next(tesseract::RIL_PARA));

delete it;

To add some validation to the recognized content, we’ll also check the recognition confidence. In my experience, it seems like any recognition scoring less than 80% turned out as gibberish. In that case, you’ll have to do some preprocessing to remove some of the risk.

int confidence = api->MeanTextConf();
printf("Confidence: %d\n", confidence);
if(confidence < 80)
    printf("Confidence is low!\n");

The whole program:

#include <stdio.h>

#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>

#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    // Initialize tesseract-ocr with English, without specifying tessdata path.
    if (api->Init(NULL, "eng")) 
    {
        api->End();

        return 1;
    }

    // Open input image with leptonica library
    Pix *image;
    if((image = pixRead("receipt.png")) == NULL)
    {
        api->End();

        return 2;
    }
    
    api->SetImage(image);
    
    if(api->Recognize(NULL) < 0)
    {
        api->End();
        pixDestroy(&image);

        return 3;
    }

    int confidence = api->MeanTextConf();
    printf("Confidence: %d\n", confidence);
    if(confidence < 80)
        printf("Confidence is low!\n");

    tesseract::ResultIterator *it = api->GetIterator();

    do 
    {
        printf("=================\n");
        if(it->Empty(tesseract::RIL_PARA))
            continue;

        char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);
        printf("%s", para_text);
        delete[] para_text;
    } while (it->Next(tesseract::RIL_PARA));
  
    delete it;

    api->End();
    pixDestroy(&image);

    return 0;
}

Applying this routine to an invoice (randomly found with Google), it is far easier to identify the address, tax, total, etc.. then with the previous method (which was naive about layout):

Confidence: 89
=================
Invoice |NV0010 '
To
Jackie Kensington
18 High St
Wickham
 Electrical
Sevices Limited

=================
Certificate Number CER/123-34 From
  1”” E'e°‘''°‘'’‘'Se”'°‘*’
17 Harold Grove, Woodhouse, Leeds,

=================
West Yorkshire, LS6 2EZ

=================
Email: info@ mj-e|ectrcia|.co.uk

=================
Tel: 441132816781

=================
Due Date : 17th Mar 2012

=================
Invoice Date : 16th Feb 2012

=================
Item Quantity Unit Price Line Price
Electrical Labour 4 £33.00 £132.00

=================
Installation carried out on flat 18. Installed 3 new
brass effect plug fittings. Checked fuses.
Reconnected light switch and fitted dimmer in living
room. 4 hours on site at £33 per hour.

=================
Volex 13A 2G DP Sw Skt Wht Ins BB Round Edge Brass Effect 3 £15.57 £46.71
Volex 4G 1W 250W Dimmer Brushed Brass Round Edge 1 £32.00 £32.00
Subtotal £210.71

=================
VAT £42.14

=================
Total £252.85

=================
Thank you for your business — Please make all cheques payable to ‘Company Name’. For bank transfers ‘HSBC’, sort code
00-00-00, Account no. 01010101.
MJ Electrical Services, Registered in England & Wales, VAT number 9584 158 35
 Q '|'.~..a::

OCR a Document with C++

There is an OCR library developed by HP and maintained by Google called Tesseract. It works immediately, and does not require training.

Building it is trivial. What’s more trivial is just installing it from packages:

$ sudo apt-get install libtesseract3 libtesseract-dev
$ sudo apt-get install liblept3 libleptonica-dev
$ sudo apt-get install tesseract-ocr-eng

Note that this installs the data for recognizing English.

Now, go and get the example code from the Google Code wiki for the project, and paste it into a file called ocr-test.cpp . Also, right-click and save the example document image (a random image I found with Google). You don’t have to use this particular document, as long as what is used is sufficiently clear at a high-enough resolution (the example is about 1500×2000).

Now, change the location of the file referred-to by the example code:

Pix *image = pixRead("letter.jpg");

Compile/link it:

$ g++ -o ocr-test ocr-test.cpp -ltesseract -llept

Run the example:

./ocr-test

You’re done. The following will be displayed:

OCR output:
fie’  1?/2440
Brussels,
BARROSO (2012) 1300171
BARROSO (2012)

Dear Lord Tugendhat.

Thank you for your letter of 29 October and for inviting the European Commission to

contribute in the context of the Economic Aflairs Committee's inquiry into "The

Economic lmplicationsfirr the United Kingdom of Scottish Independence ".

The Committee will understand that it is not the role of the European Commission to

express a position on questions of internal organisation related to the constitutional

arrangements of a particular Member State.

Whilst refraining from comment on possible fitture scenarios. the European Commission

has expressed its views in general in response to several parliamentary questions from

Members of the European Parliament. In these replies the European Commission has 
noted that scenarios such as the separation of one part of a Member State or the creation 
of a new state would not be neutral as regards the EU Treaties. The European 
Commission would express its opinion on the legal consequences under EU law upon ;
requestfiom a Member State detailing a precise scenario. :
The EU is founded on the Treaties which apply only to the Member States who have 3
agreed and ratified them. if part of the territory of a Member State would cease to be ,
part of that state because it were to become a new independent state, the Treaties would

no longer apply to that territory. In other words, a new independent state would, by the
fact of its independence, become a third country with respect to the E U and the Treaties

would no longer apply on its territory. ‘

../.

The Lord TUGENDHAT
Acting Chairman

House of Lords q
Committee Oflice
E-mail: economicaflairs@par1igment.ttk

Displaying C++ vtables

A vtable is a mapping that allows your C++ application properly reconcile the function pointers for the base classes that have virtual methods and the child classes that override those methods (or do not override them). A class that does not have virtual methods will not have a vtable.

A vtable is pointed to by a pointer (“vpointer”) at the top of each object, usually, where the vtable is the same for all objects of a particular class.

Though you can derive the pointer yourself, you can use gdb, ddd, etc.. to display it:

Source code:

class BaseClass
{
    public:

    virtual int call_me1()
    {
        return 5;
    }

    virtual int call_me2()
    {
        return 10;
    }

    int call_me3()
    {
        return 15;
    }
};

class ChildClass : public BaseClass
{
    public:

    int call_me1()
    {
        return 20;
    }

    int call_me2()
    {
        return 25;
    }
};

Compile this with:

g++ -fdump-class-hierarchy -o vtable_example vtable_example.cpp

This emits a “.class” file that has the following (I’ve skipped some irrelevant information at the top, about other types:

Vtable for BaseClass
BaseClass::_ZTV9BaseClass: 4u entries
0     (int (*)(...))0
4     (int (*)(...))(& _ZTI9BaseClass)
8     (int (*)(...))BaseClass::call_me1
12    (int (*)(...))BaseClass::call_me2

Class BaseClass
   size=4 align=4
   base size=4 base align=4
BaseClass (0x0xb6a09230) 0 nearly-empty
    vptr=((& BaseClass::_ZTV9BaseClass) + 8u)

Vtable for ChildClass
ChildClass::_ZTV10ChildClass: 4u entries
0     (int (*)(...))0
4     (int (*)(...))(& _ZTI10ChildClass)
8     (int (*)(...))ChildClass::call_me1
12    (int (*)(...))ChildClass::call_me2

Class ChildClass
   size=4 align=4
   base size=4 base align=4
ChildClass (0x0xb76fdc30) 0 nearly-empty
    vptr=((& ChildClass::_ZTV10ChildClass) + 8u)
  BaseClass (0x0xb6a092a0) 0 nearly-empty
      primary-for ChildClass (0x0xb76fdc30)

Compiling your C/C++ and Obj C/C++ Code with “clang” (LLVM)

clang is a recent addition to the landscape of development within the C family. Though GCC is a household name (well, in my household), clang is built on LLVM, a modular and versatile compiler platform. In fact, because it’s built on LLVM, clang can emit a readable form of LLVM byte-code:

Source:

#include <stdio.h>

int main()
{
    printf("Testing.\n");

    return 0;
}

Command:

clang -emit-llvm -S main.c -o -

Output:

; ModuleID = 'main.c'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32-n8:16:32-S128"
target triple = "i386-pc-linux-gnu"

@.str = private unnamed_addr constant [10 x i8] c"Testing.\0A\00", align 1

define i32 @main() nounwind {
  %1 = alloca i32, align 4
  store i32 0, i32* %1
  %2 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([10 x i8]* @.str, i32 0, i32 0))
  ret i32 0
}

declare i32 @printf(i8*, ...)

Benefits of using clang versus gcc include the following:

  • Considerably better error messages (a very popular feature).
  • Considerable speed improvements and resource usage, across the board, per clang (http://clang.llvm.org/features.html#performance). This might not be the case, though, per the average discussion on Stack Overflow.
  • Its ASTs and code are allegedly simpler and more straight-forward for those individuals that would like to study them.
  • It’s a single parser for the C family of languages (including Objective C/C++, but does not include C#), while also promoting the ability to be further extended.
  • It’s built as an API, so it can be bound by other tools.

clang is not nearly as mature as GCC, though I haven’t seen (as a casual observer) much negative feedback due to this.

To do a two-part build like you would with GCC, the basic parameters are similar, though there are six-pages of parameters available:

clang -emit-llvm -o foo.bc -c foo.c
clang -o foo foo.bc

It’s important to mention that clang comes bundled with a static analyzer. This means that checking your code for bugs at a deeper level than the compiler is concerned with is that much more accessible. For example, if we adjust the code above to do an allocation, but neglect to free it:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    printf("Testing.\n");

    void *new = malloc((size_t)2000);

    return 0;
}

We can build, while also telling clang to invoke the static-analyzer:

clang --analyze main.c -o main
main.c:8:11: warning: Value stored to 'new' during its initialization is never read
    void *new = malloc((size_t)2000);
          ^~~   ~~~~~~~~~~~~~~~~~~~~
main.c:10:5: warning: Memory is never released; potential leak of memory pointed to by 'new'
    return 0;
    ^
2 warnings generated.

In truth, I don’t know how clang’s static-analyzer compares with Valgrind, the standard, heavyweight, open-source static-analyzer. Though Valgrind can actually run your program and watch to make sure that your allocations are managed properly, I’m not yet sure if clang’s static-analyzer can do the same.

Vectors in C

I’ve implemented a vector-type called “list” in C. It uses contiguous blocks of memory and grows in an identical way as C++’s STL vectors.

This is the example that’s bundled with it:

#include <stdio.h>

#include "list.h"

static bool enumerate_cb(list_t *list, 
                         uint32_t index, 
                         void *value, 
                         void *context)
{
    char *text = (char *)value;
    printf("Item (%" PRIu8 "): [%s]\n", index, text);

    // Return false to stop enumeration (enumeration will return successful).
    return true;
}

int main()
{
    list_t list;
    const uint32_t entry_width = 20;
    
    if(list_init(&list, entry_width) != 0)
    {
        printf("Could not initialize list.\n");
        return 1;
    }

    char text[20];
    const uint8_t count = 10;
    uint8_t i = 0;
    while(i < count)
    {
        snprintf(text, 20, "Test: %" PRIu8, i);
        printf("Pushing: %s\n", text);

        if(list_push(&list, text) != 0)
        {
            printf("Could not push item.\n");
            return 2;
        }
    
        i++;
    }

    printf("\n");

    // NOTE: For efficiency, this is a reference to within the list. If you
    //       want a copy, make a copy. If you want to make sure this is thread-
    //       safe, use a lock.
    void *retrieved;
    if((retrieved = list_get(&list, 5)) == NULL)
    {
        printf("Could not retrieve item.\n");
        return 3;
    }

    printf("Retrieved: %s\n", (char *)retrieved);
    printf("Removing.\n");

    if(list_remove(&list, 5) != 0)
    {
        printf("Could not remove item.\n");
        return 4;
    }

    printf("\n");
    printf("Enumerating:\n");

    if(list_enumerate(&list, enumerate_cb, NULL) != 0)
    {
        printf("Could not enumerate list.\n");
        return 5;
    }

    if(list_destroy(&list) != 0)
    {
        printf("Could not destroy list.\n");
        return 6;
    }

    return 0;
}

Output:

$ ./example 
Pushing: Test: 0
Pushing: Test: 1
Pushing: Test: 2
Pushing: Test: 3
Pushing: Test: 4
Pushing: Test: 5
Pushing: Test: 6
Pushing: Test: 7
Pushing: Test: 8
Pushing: Test: 9

Retrieved: Test: 5
Removing.

Enumerating:
Item (0): [Test: 0]
Item (1): [Test: 1]
Item (2): [Test: 2]
Item (3): [Test: 3]
Item (4): [Test: 4]
Item (5): [Test: 6]
Item (6): [Test: 7]
Item (7): [Test: 8]
Item (8): [Test: 9]

Using “dialog” for Nice, Easy, C-Based Console Dialogs

dialog is a great command-line-based dialog tool that let’s you construct twenty-three types of dialog screens, that resemble the best of any available dialog utilities.

It’s as simple as running the following from the command-line:

dialog --yesno "Yes or no, please." 6 30

Very few of the users of dialog probably know that it can be statically linked to provide the same functionality in a C application. It doesn’t help that there is almost no documentation on the subject.

This is an example of how to create a “yesno” dialog:

#include <curses.h>
#include <dialog.h>

int main()
{
    int rc;
    init_dialog(stdin, stderr);
    rc = dialog_yesno("title", "message", 0, 0);
    end_dialog();

    return rc;
}

I explicitly pre-include curses.h so dialog.h won’t go looking in the wrong place. It might be different in your situation.

To build:

gcc -o example example.c -L dialogpath -I dialogpath -ldialog -lncurses -lm

Just configure and build your dialog sources, and then use that path in the make line, above.

This program will return an integer representing which button was pressed (true/0, false/1), or whether the dialog was cancelled with ESC (255).