Creating and Controlling OS Services from Python

One important deployment task of server software is to not only deploy the software and then start it, but to enable it to be automatically started and monitored by the OS at future reboots. The most modern solution for this type of management is Upstart. You access Upstart every time you call “sudo service apache2 restart”, and whatnot. Upstart is sponsored by Ubuntu (more specifically, Canonical).

Upstart configs are located in /etc/init (we’re slowly, slowly approaching the point where we might one day be able to get rid of the System-V init scripts, in /etc/init.d). To create a job, you drop a “xyz.conf” file into /etc/init, and Upstart should automatically become aware of it via inotify. To query Upstart (including starting and stopping jobs), you emit a D-Bus message.

So, what about elegantly automating the creation of a job for the service from your Python deployment code? There is exactly one solution for doing so, and it’s a Swiss Army Knife for such a task.

We’re going to use the Python upstart library to build a job and then write it (in fact, we’re just going to share one of their examples, for your convenience). The library also allows for listing the jobs on the system, getting statuses, and starting/stopping jobs, among other things, but we’ll leave it to you to experiment with this, when you’re ready.

Build a job that starts and stops on the normal run-levels, respawns when it terminates, and runs a single command (a non-forking process, otherwise we’d have to add the ‘expect’ stanza as well):

from upstart.job import JobBuilder

jb = JobBuilder()

# Build the job to start/stop with default runlevels to call a command.
jb.description('My test job.').\
   author('Dustin Oprea <dustin@nowhere.com>').\
   start_on_runlevel().\
   stop_on_runlevel().\
   run('/usr/bin/my_daemon')

with open('/etc/init/my_daemon.conf', 'w') as f:
    f.write(str(jb))

Remember to run this as root. The job output looks like this:

description "My test job."
author "Dustin Oprea <dustin@nowhere.com>"
start on runlevel [2345]
stop on runlevel [016]
respawn 
exec /usr/bin/my_daemon

Making an APNS Push Certificate

It turns out that producing a certificate-request that Apple will accept in order to authorize you to send notifications to a client’s phone on their behalf is nightmarish, due to the shear lack of information on the subject (Apple provides no documentation).

Behold, csr_to_apns_csr. As long as you have your “MDM vendor certificate” (a P12 certificate that Apple gives you) and the CSR for your client, you’re in business.

$ csr_to_apns_csr -h
usage: csr_to_apns_csr [-h] [-x] csr vendor_p12 vendor_p12_pass

Produce an Apple-formatted APNS 'push' CSR.

positional arguments:
  csr              client CSR (PEM)
  vendor_p12       MDM vendor P12 certificate (DER)
  vendor_p12_pass  passphrase for MDM vendor P12 certificate

optional arguments:
  -h, --help       show this help message and exit
  -x, --xml        show raw XML

Using ssl.wrap_socket for Secure Sockets in Python

Ordinarily, the prospect of having to deal with SSL-encrypted sockets would be enough to make the best of us take a long weekend. However, Python provides some prepackaged functionality to accommodate this. It’s called “wrap_socket”. The only reason that I ever knew about this was from reverse engineering, as I’ve never come upon this in a blog/article.

Here’s an example. Note that I steal the CA bundle from requests, for the purpose of this example. Use whichever bundle you happen to have available (they should all be relatively similar, but will generally be located different places on your system, depending on your OS/distribution).

import ssl
import socket

s_ = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = ssl.wrap_socket(s_, 
                    ca_certs='/usr/local/lib/python2.7/dist-packages/requests/cacert.pem', 
                    cert_reqs=ssl.CERT_REQUIRED)

s.connect(('www.google.com', 443))

# s.cipher() - Returns a tuple: ('RC4-SHA', 'TLSv1/SSLv3', 128)
# s.getpeercert() - Returns a dictionary: 
#
#   {'notAfter': 'May 15 00:00:00 2014 GMT',
#    'subject': ((('countryName', u'US'),),
#                (('stateOrProvinceName', u'California'),),
#                (('localityName', u'Mountain View'),),
#                (('organizationName', u'Google Inc'),),
#                (('commonName', u'www.google.com'),)),
#    'subjectAltName': (('DNS', 'www.google.com'),)}

s.write("""GET / HTTP/1.1\r
Host: www.google.com\r\n\r\n""")

# Read the first part (might require multiple reads depending on size and 
# encoding).
d = s.read()
s.close()

Obviously, your data sits in d, after this code runs.

Using etcd as a Clusterized, Immediately-Consistent Key-Value Storage

The etcd project was one of the first popular, public platforms built on the Raft algorithm (a relatively simple consensus algorithm, used to allow several nodes to remain in sync). Raft represents a shift away from its predecessor, Paxos, which is considerably more difficult to understand, and usually requires shortcuts to implement. As an added bonus, etcd is also implemented in Go.

etcd looks and smells like every other KV store, with three especially-notable differences:

  • You can maintain a heirarchy of keys.
  • You can long-poll for changes on keys.
  • Distributed-locks are built-in.

We’re going to use Python’s etcd package (project is here). This package presents a very intuitive interface that completely manages responses from the server and is built in such a way that future API changes should be backward-compatible (to within reason). These things are important, as other clients have historically allowed the application too much direct access to the actual server requests, and left too much of the interpretation of the responses to the application as well.

To connect the client (assuming the same machine with the default port):

from etcd import Client

c = Client()

To set a value:

c.node.set('/test/key', 5)

To get a value:

r = c.node.get('/test/key')
print(r.node.value)

Which outputs:

5

To wait on a value to change, run this from another terminal:

r = c.node.wait('/test/key')

Try setting the node to something else using a command similar to before. The wait call will return with the same result as the instance of the client that actually made the request.

To work with distributed locks, just wrap the code that needs to be synchronized in a with statements:

with c.module.lock.get_lock('test_lock_1', ttl=10):
    print("In lock 1.")

It’s worth mentioning that the response objects have a consistent and informative interface no matter what the operation. You can see a number properties just by printing it. This is from the set operation above:

<RESPONSE: <NODE(ResponseV2AliveNode) [set] [/test/key] IS_HID=[False] IS_DEL=[False] IS_DIR=[False] IS_COLL=[False] TTL=[None] CI=(2) MI=(2)>>

This is from the get operation:

<RESPONSE: <NODE(ResponseV2AliveNode) [get] [/test/key] IS_HID=[False] IS_DEL=[False] IS_DIR=[False] IS_COLL=[False] TTL=[None] CI=(2) MI=(2)>>

I’ll omit the examples of working with heirarchical keys because the functionality is every bit as intuitive as it should be.

There’s a lot of functionality in the Python etcd package, but it’s built to be lightweight and obvious. The GitHub page is extremely thorough, and the API is also completely documented at ReadTheDocs.

Manager Namespaces for IPC Between Python Process Pools

Arguably, one of the functionalities that best represent why Python has so many multidisciplinary uses is its multiprocessing library. This library allows Python to maintain pools of processes and communication between these processes with most of the simplicity of a standard multithreaded application (asynchronously invoking a function, locking, and IPC). This is not to say that Python can’t do threads, too, but, due to being able to quickly run map/reduce operations or asynchronous tasks using a very simple set of functions combined with the disadvantages of having to consider the GIL when doing multithreaded development, I believe the multiprocess design to be more popular by a landslide.

There are mountains of examples for how to use multiprocessing, along with sufficient documentation for most of the IPC mechanisms that can be used to communicate between processes: queues, pipes, “manager”-based shares and proxy objects, shared ctypes types, multiprocessing-based “client” and “listener” sockets, etc..

There is a very subtle IPC mechanism called a “namespace” (which is actual part of Manager), whose presence in the documentation only speaks for a couple of lines of the thousands that are there. It’s easy, and worth special mention.

from multiprocessing import Pool, Manager
from os import getpid
from time import sleep

def _worker(ns):
    pid = getpid()
    print("%d: Worker started." % (pid))

    while ns.is_running is True:
        sleep(1)

    print("%d: Worker terminating." % (pid))

m = Manager()
ns = m.Namespace()
ns.is_running = True

num_workers = 5
p = Pool(num_workers)

for i in xrange(num_workers):
    p.apply_async(_worker, (ns,))

sleep(10)
print("Shutting down.")

ns.is_running = False
p.close()
p.join()

print("All workers joined.")

The output:

52893: Worker started.
52894: Worker started.
52895: Worker started.
52896: Worker started.
52897: Worker started.
Shutting down.
52894: Worker terminating.
52893: Worker terminating.
52895: Worker terminating.
52896: Worker terminating.
52897: Worker terminating.
All workers joined.

A namespace is very much like a bulletin board, where attributes can be assigned by one process, and read by others. This works for immutable values like strings and primitive types. Otherwise, updates can’t be tracked properly:

from multiprocessing import Manager

m = Manager()
ns = m.Namespace()

ns.test_value = 'original value'
ns.test_list = [5]

print("test_value (master, original): %s" % (ns.test_value))
print("test_list (master, original): %s" % (ns.test_list))

ns.test_value = 'new value'
ns.test_list.append(10)

print("test_value (master, updated): %s" % (ns.test_value))
print("test_list (master, updated): %s" % (ns.test_list))

Output:

test_value (master, original): original value
test_list (master, original): [5]
test_value (master, updated): new value
test_list (master, updated): [5]

Though not working for some types, namespaces are a terrific mechanism for sharing counters and flags between processes.

OCR a Document by Section

A previous post described how to use Tesseract to OCR a document to a single string. This post will describe how to take advantage of Tesseract’s internal layout processing to iterate through the documents sections (as determined by Tesseract).

This is the core logic to iterate through the document. Each section is referred to as a “paragraph”:

if(api->Recognize(NULL) < 0)
{
    api->End();
    pixDestroy(&image);

    return 3;
}

tesseract::ResultIterator *it = api->GetIterator();

do 
{
    if(it->Empty(tesseract::RIL_PARA))
        continue;

    char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);

// para_text is the recognized text. It [usually] has a 
// newline on the end.

    delete[] para_text;
} while (it->Next(tesseract::RIL_PARA));

delete it;

To add some validation to the recognized content, we’ll also check the recognition confidence. In my experience, it seems like any recognition scoring less than 80% turned out as gibberish. In that case, you’ll have to do some preprocessing to remove some of the risk.

int confidence = api->MeanTextConf();
printf("Confidence: %d\n", confidence);
if(confidence < 80)
    printf("Confidence is low!\n");

The whole program:

#include <stdio.h>

#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>

#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    // Initialize tesseract-ocr with English, without specifying tessdata path.
    if (api->Init(NULL, "eng")) 
    {
        api->End();

        return 1;
    }

    // Open input image with leptonica library
    Pix *image;
    if((image = pixRead("receipt.png")) == NULL)
    {
        api->End();

        return 2;
    }
    
    api->SetImage(image);
    
    if(api->Recognize(NULL) < 0)
    {
        api->End();
        pixDestroy(&image);

        return 3;
    }

    int confidence = api->MeanTextConf();
    printf("Confidence: %d\n", confidence);
    if(confidence < 80)
        printf("Confidence is low!\n");

    tesseract::ResultIterator *it = api->GetIterator();

    do 
    {
        printf("=================\n");
        if(it->Empty(tesseract::RIL_PARA))
            continue;

        char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);
        printf("%s", para_text);
        delete[] para_text;
    } while (it->Next(tesseract::RIL_PARA));
  
    delete it;

    api->End();
    pixDestroy(&image);

    return 0;
}

Applying this routine to an invoice (randomly found with Google), it is far easier to identify the address, tax, total, etc.. then with the previous method (which was naive about layout):

Confidence: 89
=================
Invoice |NV0010 '
To
Jackie Kensington
18 High St
Wickham
 Electrical
Sevices Limited

=================
Certificate Number CER/123-34 From
  1”” E'e°‘''°‘'’‘'Se”'°‘*’
17 Harold Grove, Woodhouse, Leeds,

=================
West Yorkshire, LS6 2EZ

=================
Email: info@ mj-e|ectrcia|.co.uk

=================
Tel: 441132816781

=================
Due Date : 17th Mar 2012

=================
Invoice Date : 16th Feb 2012

=================
Item Quantity Unit Price Line Price
Electrical Labour 4 £33.00 £132.00

=================
Installation carried out on flat 18. Installed 3 new
brass effect plug fittings. Checked fuses.
Reconnected light switch and fitted dimmer in living
room. 4 hours on site at £33 per hour.

=================
Volex 13A 2G DP Sw Skt Wht Ins BB Round Edge Brass Effect 3 £15.57 £46.71
Volex 4G 1W 250W Dimmer Brushed Brass Round Edge 1 £32.00 £32.00
Subtotal £210.71

=================
VAT £42.14

=================
Total £252.85

=================
Thank you for your business — Please make all cheques payable to ‘Company Name’. For bank transfers ‘HSBC’, sort code
00-00-00, Account no. 01010101.
MJ Electrical Services, Registered in England & Wales, VAT number 9584 158 35
 Q '|'.~..a::

OCR a Document with C++

There is an OCR library developed by HP and maintained by Google called Tesseract. It works immediately, and does not require training.

Building it is trivial. What’s more trivial is just installing it from packages:

$ sudo apt-get install libtesseract3 libtesseract-dev
$ sudo apt-get install liblept3 libleptonica-dev
$ sudo apt-get install tesseract-ocr-eng

Note that this installs the data for recognizing English.

Now, go and get the example code from the Google Code wiki for the project, and paste it into a file called ocr-test.cpp . Also, right-click and save the example document image (a random image I found with Google). You don’t have to use this particular document, as long as what is used is sufficiently clear at a high-enough resolution (the example is about 1500×2000).

Now, change the location of the file referred-to by the example code:

Pix *image = pixRead("letter.jpg");

Compile/link it:

$ g++ -o ocr-test ocr-test.cpp -ltesseract -llept

Run the example:

./ocr-test

You’re done. The following will be displayed:

OCR output:
fie’  1?/2440
Brussels,
BARROSO (2012) 1300171
BARROSO (2012)

Dear Lord Tugendhat.

Thank you for your letter of 29 October and for inviting the European Commission to

contribute in the context of the Economic Aflairs Committee's inquiry into "The

Economic lmplicationsfirr the United Kingdom of Scottish Independence ".

The Committee will understand that it is not the role of the European Commission to

express a position on questions of internal organisation related to the constitutional

arrangements of a particular Member State.

Whilst refraining from comment on possible fitture scenarios. the European Commission

has expressed its views in general in response to several parliamentary questions from

Members of the European Parliament. In these replies the European Commission has 
noted that scenarios such as the separation of one part of a Member State or the creation 
of a new state would not be neutral as regards the EU Treaties. The European 
Commission would express its opinion on the legal consequences under EU law upon ;
requestfiom a Member State detailing a precise scenario. :
The EU is founded on the Treaties which apply only to the Member States who have 3
agreed and ratified them. if part of the territory of a Member State would cease to be ,
part of that state because it were to become a new independent state, the Treaties would

no longer apply to that territory. In other words, a new independent state would, by the
fact of its independence, become a third country with respect to the E U and the Treaties

would no longer apply on its territory. ‘

../.

The Lord TUGENDHAT
Acting Chairman

House of Lords q
Committee Oflice
E-mail: economicaflairs@par1igment.ttk