OCR a Document by Section

A previous post described how to use Tesseract to OCR a document to a single string. This post will describe how to take advantage of Tesseract’s internal layout processing to iterate through the documents sections (as determined by Tesseract).

This is the core logic to iterate through the document. Each section is referred to as a “paragraph”:

if(api->Recognize(NULL) < 0)
{
    api->End();
    pixDestroy(&image);

    return 3;
}

tesseract::ResultIterator *it = api->GetIterator();

do 
{
    if(it->Empty(tesseract::RIL_PARA))
        continue;

    char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);

// para_text is the recognized text. It [usually] has a 
// newline on the end.

    delete[] para_text;
} while (it->Next(tesseract::RIL_PARA));

delete it;

To add some validation to the recognized content, we’ll also check the recognition confidence. In my experience, it seems like any recognition scoring less than 80% turned out as gibberish. In that case, you’ll have to do some preprocessing to remove some of the risk.

int confidence = api->MeanTextConf();
printf("Confidence: %d\n", confidence);
if(confidence < 80)
    printf("Confidence is low!\n");

The whole program:

#include <stdio.h>

#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>

#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    // Initialize tesseract-ocr with English, without specifying tessdata path.
    if (api->Init(NULL, "eng")) 
    {
        api->End();

        return 1;
    }

    // Open input image with leptonica library
    Pix *image;
    if((image = pixRead("receipt.png")) == NULL)
    {
        api->End();

        return 2;
    }
    
    api->SetImage(image);
    
    if(api->Recognize(NULL) < 0)
    {
        api->End();
        pixDestroy(&image);

        return 3;
    }

    int confidence = api->MeanTextConf();
    printf("Confidence: %d\n", confidence);
    if(confidence < 80)
        printf("Confidence is low!\n");

    tesseract::ResultIterator *it = api->GetIterator();

    do 
    {
        printf("=================\n");
        if(it->Empty(tesseract::RIL_PARA))
            continue;

        char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);
        printf("%s", para_text);
        delete[] para_text;
    } while (it->Next(tesseract::RIL_PARA));
  
    delete it;

    api->End();
    pixDestroy(&image);

    return 0;
}

Applying this routine to an invoice (randomly found with Google), it is far easier to identify the address, tax, total, etc.. then with the previous method (which was naive about layout):

Confidence: 89
=================
Invoice |NV0010 '
To
Jackie Kensington
18 High St
Wickham
 Electrical
Sevices Limited

=================
Certificate Number CER/123-34 From
  1”” E'e°‘''°‘'’‘'Se”'°‘*’
17 Harold Grove, Woodhouse, Leeds,

=================
West Yorkshire, LS6 2EZ

=================
Email: info@ mj-e|ectrcia|.co.uk

=================
Tel: 441132816781

=================
Due Date : 17th Mar 2012

=================
Invoice Date : 16th Feb 2012

=================
Item Quantity Unit Price Line Price
Electrical Labour 4 £33.00 £132.00

=================
Installation carried out on flat 18. Installed 3 new
brass effect plug fittings. Checked fuses.
Reconnected light switch and fitted dimmer in living
room. 4 hours on site at £33 per hour.

=================
Volex 13A 2G DP Sw Skt Wht Ins BB Round Edge Brass Effect 3 £15.57 £46.71
Volex 4G 1W 250W Dimmer Brushed Brass Round Edge 1 £32.00 £32.00
Subtotal £210.71

=================
VAT £42.14

=================
Total £252.85

=================
Thank you for your business — Please make all cheques payable to ‘Company Name’. For bank transfers ‘HSBC’, sort code
00-00-00, Account no. 01010101.
MJ Electrical Services, Registered in England & Wales, VAT number 9584 158 35
 Q '|'.~..a::
Advertisements