Using the “Tig” Git Console UI

At its simplest, Tig allows you to navigate your Git projects from the console (it internally invokes commands to git). It has nearly all of the browsing functionality of Github while readily running locally. At it’s most-complicated, it looks to be as flexible as Git itself.

The two simplest ways to run Tig (from within our Git project):

  • Piping: git log | tig
  • Calling directly: tig

In the case of piping, you’re really just benefiting by coloring the output and pumping it through pagination. If you’re going to call Tig directly, the experience will be more interactive. The default “view” is the log.

You can also specify other views:

$ tig -h
tig 1.2.1 (Nov 29 2013)

Usage: tig        [options] [revs] [--] [paths]
   or: tig log    [options] [revs] [--] [paths]
   or: tig show   [options] [revs] [--] [paths]
   or: tig blame  [options] [rev] [--] path
   or: tig stash
   or: tig status
   or: tig <      [git command output]

Options:
  +<number>       Select line <number> in the first view
  -v, --version   Show version and exit
  -h, --help      Show help message and exit

An example of the commit browser. I’ve clicked on a commit to show its diffs:

Git "Tig" Commit Browser

An example of blaming:

Git "Tig" Blame Browser
For more information:

Screenshots
Manual

Using etcd as a Highly Available and Innovative Key-Value Storage

etcd was created as the primary building-block on which CoreOS is built. It uses the Raft algorithm to keep changes consistent throughout a cluster by electing a leader and distributing a log of operations (“commands”) from the leader to the other systems. Due to these features and others, etcd to be used for robust service-discovery and cluster configuration, replacing ZooKeeper. Entries are referred-to as “nodes”.

 

Distributed Locks

Every update automatically increments the “index”, which is a global, monotonically-increasing value, incremented for every operation:

c.set('/a/b/c', 5).index
# 66
c.set('/a/b/c', 5).index
# 67

The index increases for every operation, not just those with side-effects. Per the mailing list (2013-11-29), the reason for this is:

That’s a side effect of how Raft works. When new commands come in they get sent to Raft immediately which increments the index. We’re not able to check the current value of the key before insert because Raft batches commands so there may be uncommitted changes between the current state and the state at the time when the command is being committed. That’s also why changes that cause errors can increment the index even though no change was made.

etcd also gives us a “CAS” (“compare and swap”) call (“test_and_set” in the Python client). This allows us to assign a value to a key, but only when the existing value meets one or more conditions:

  1. The existing value is set to something specific (a “previous value” condition).
  2. The existing index is set to something specific (a “previous index” condition).
  3. The key either currently exists or doesn’t (a “previously exists” condition).

The existence of a monotonic, atomic counter and a CAS function happen to be the exact dependencies required to establish distributed locking. The process might be the following:

  1. Initialize a node for the specific lock (“lock node”). Use CAS with a “prevExists” of “false” and a value of “0”.
  2. Assign some value to some dummy key used for the purpose of incrementing and grabbing the index. This index will be used as a unique ID for the current thread/instance (“instance ID”).
  3. Do a CAS on the lock node with a “prevValue” of “0”, a value of the instance-ID, and a TTL of whatever maximum lock time we should allow.
    • If error, watch the lock node. Give the HTTP client a timeout. Try again after long-polling returns or timeout hits.
    • If no error, do whatever logic is required, and, to release, use a CAS to set the lock-node to “0” with a “prevValue” of the instance-ID. If this fails (ValueError), then the lock has been reowned by another instance after having timed-out.

It’s important to mention that the “test_and_set” operation in the Python client only currently supports the “prevValue” condition. With the “prevValue” condition, you’ll get a KeyError if the key doesn’t exist. If the real existing value does not match the stated existing value, you’ll get a ValueError (which is a standard consideration when using this call).

 

Additional Features

Aside from being so consistent and having easy access to the operations via REST, there are two non-traditional operations that you’ll see with etcd but not with [most] other KV solutions:

  1. Entries can be stored in a hierarchy
  2. Long-polling to wait on a change to a key or folder (“watch”)

With (2), you can monitor a key that doesn’t yet exist, or even a folder (in which case, it’ll block until any value inside the folder changes, recursively). You can use this to achieve event-driven scripts (a neat usage mentioned on the mailing list).

Lastly, before moving on to the example, the cluster should be kept small:

Every command the client sends to the master is broadcast to all of the 
followers. The command is not committed until the majority of the cluster peers 
receive that command.

Because of this majority voting property, the ideal cluster should be kept 
small to keep speed up and be made up of an odd number of peers.

(what size cluster should I use)

etcd is based on Google’s Chubby (which uses Paxos rather than Raft).

 

Quick Start

For this example, we’re going to establish and interact with etcd using three different terminals on the same system. etcd requires Go 1.1+. You’ll probably have to build it (via a “Git” clone call, and a build), as it’s not yet available via many package managers (Ubuntu, specifically).

Run etcd:

$ etcd
[etcd] Nov 28 13:02:20.849 INFO      | Wrote node configuration to 'info'
[etcd] Nov 28 13:02:20.849 INFO      | etcd server [name default-name, listen on 127.0.0.1:4001, advertised url http://127.0.0.1:4001]
[etcd] Nov 28 13:02:20.850 INFO      | raft server [name default-name, listen on 127.0.0.1:7001, advertised url http://127.0.0.1:7001]

Creating a cluster is as easy as simply launching additional instances of the daemon on new hosts. Now, install Python’s python-etcd:

sudo pip install python-etcd

Connect the client:

from etcd import Client
c = Client(host='127.0.0.1')

Set a value (notice that we have to specify a folder, even if it’s only the root):

c.set('/test_entry', 'abc')

EtcdResult(action=u'SET', index=9, key=u'/test_entry', prevValue=None, value=u'abc', expiration=None, ttl=None, newKey=True)
# Actions available on EtcdResult: action, count, expiration, index, key, newKey, prevValue, ttl, value

Get the value:

r = c.get('/test_entry')
print(r.value)
# Prints "abc"

In a second terminal, connect the client and run the following to block for a change to the given folder (it doesn’t currently exist):

r = c.watch('/test_folder')

Back in the first terminal, run:

c.set('/test_folder/test_inner_folder/deep_test', 'abc')

The command waiting in the second terminal has now returned. Examine “r”:

print(r)
EtcdResult(action=u'SET', index=15, key=u'/test_folder/test_inner_folder/deep_test', prevValue=None, value=u'abc', expiration=None, ttl=None, newKey=True)

Get a listing of children. This may or may not work on “/”, depending on your python-etcd version:

from pprint import pprint
c.set('/test_folder/entry_1', 'test_value_1')
c.set('/test_folder/entry_2', 'test_value_2')
list_ = c.get('/test_folder')
pprint(list_)
#[EtcdResult(action=u'GET', index=4, key=u'/test_folder/entry_1', prevValue=None, value=u'test_value_1', expiration=None, ttl=None, newKey=None),
# EtcdResult(action=u'GET', index=4, key=u'/test_folder/entry_2', prevValue=None, value=u'test_value_2', expiration=None, ttl=None, newKey=None)]

etcd also allows for TTLs (in seconds) on “put” operations:

from time import sleep
c.set('/disappearing_entry', 'inconsequential_value', ttl=5)
sleep(5)
c.get('/disappearing_entry')

You’ll get the following error (a proper KeyError):

Traceback (most recent call last):
  File "", line 1, in 
  File "/Library/Python/2.7/site-packages/etcd/client.py", line 284, in get
    response = self.api_execute(self.key_endpoint + key, self._MGET)
  File "/Library/Python/2.7/site-packages/etcd/client.py", line 357, in api_execute
    raise error_exception(message)
KeyError: u'Key Not Found : get: /disappearing_entry'

Miscellaneous functions:

c.machines
# ['http://127.0.0.1:4001']
c.leader
# 'http://127.0.0.1:7001'

As a final note, you don’t have to choose between cURL requests and the API. Rather, there’s also etcdctl for command-line control:

$ etcdctl set /foo/bar "Hello world"
Hello world

 

FAQ

Leaders are elected using elections. However, there’s a chance that a leader won’t be elected, and the elections will have to be reattempted. From the mailing list (2013-11-29):

Q: What would cause a leader candidate to not receive a majority of votes from nodes, during elections?
A: The common case election failure would be due to either a network partition causing less than a quorum to vote, or another candidate being elected first.

Q: Is there any decision-making involved during elections, such as the consideration of the CPU utilizations of individual machines?
A: Not at this time. It might make sense to add some sort of fitness to the leader proposal decision later.

Using Bitly’s NSQ Job Queue

I’ve recently been impressed by Bitly’s NSQ server, written in Go. Aside from the part about Go capturing my attention, the part that most interested me was 1) they claim that it achieves 90,000 messages/second (which is decent), and 2) it’s relatively easy to set-up, and it’s self-managing.

The topology for NSQ is straight forward: N queue servers (nsqd), 0+ lookup servers (nsqlookupd), and an optional admin (dashboard) server (nsqadmin). The lookup servers are optional, but they allow auto-discovery of which hosts are managing which topics. Bitly recommends that a cluster of three are used in production. To start multiple instances, just launch them. You’ll have to pass in a list of nsqlookupd hosts to the consumer client, and a list of nsqd hosts to the producer client.

The message pipeline is intuitive: messages are pushed along with topics/classifiers, and consumers listen for topics and channels. A channel is a named grouping of consumers that work on similar tasks, where the “channel” is presented as a string to the consumer instance. NSQ uses the concepts of topics and channels to drive multicast and distributed delivery.

As far as optimization goes, there are about three dozen parameters for nsqd, but you need not concern yourself with most of them, here.

This example resembles the one from the NSQ website, plus some additional info. All four processes can be run from the same system.

Quick Start

Get and build the primary components. $GOPATH needs to either be set to your Go workspace (mine is ~/.go, below), or an empty directory that will be used for it. $GOPATH/bin needs to be in the path.

go get github.com/kr/godep

godep get github.com/bitly/nsq/nsqd
godep get github.com/bitly/nsq/nsqlookupd
godep get github.com/bitly/nsq/nsqadmin

To start, run each of the following services in a different terminal on the same system.

A lookup server instance:

nsqlookupd

A queue instance:

nsqd --lookupd-tcp-address=127.0.0.1:4160

An admin server instance:

nsqadmin --template-dir=~/.go/src/github.com/bitly/nsq/nsqadmin/templates --lookupd-http-address=127.0.0.1:4161

To push test-items:

curl -d 'hello world 1' 'http://127.0.0.1:4151/put?topic=test'
curl -d 'hello world 2' 'http://127.0.0.1:4151/put?topic=test'
curl -d 'hello world 3' 'http://127.0.0.1:4151/put?topic=test'

The “apps” aren’t built, apparently, by default. We’ll need these so we can get a message-dumper, for testing:

~/.go/src/github.com/bitly/nsq$ make
cd build/apps

To dump data that’s already waiting in the queues:

./nsq_to_file --topic=test --output-dir=/tmp --lookupd-http-address=127.0.0.1:4161

Display queue data:

cat /tmp/test.*.log
hello world 1
hello world 2
hello world 3

Python Library

Matt Reiferson wrote pynsq, which is a Python client that employs Tornado for it’s message-loops. The gotcha is that both the consumers -and- producers both require you to use IOLoop, Tornado’s message-loop. This is because pynsq not only allows you to define a “receive” callback, but a post-send callback as well. Though you don’t have to define one, there is an obscure, but real, chance that a send will fail, per Matt, and should always be checked for.

Because of this design, you should be prepared to put all of your core loop logic into the Tornado loop.

To install the client:

sudo pip install pynsq tornado

A producer example from the “pynsq” website:

import nsq
import tornado.ioloop
import time

def pub_message():
    writer.pub('test', time.strftime('%H:%M:%S'), finish_pub)

def finish_pub(conn, data):
    print data

writer = nsq.Writer(['127.0.0.1:4150'])
tornado.ioloop.PeriodicCallback(pub_message, 1000).start()
nsq.run()

An asynchronous consumer example from the “pynsq” website (doesn’t correspond to the producer example):

import nsq

buf = []

def process_message(message):
    global buf
    message.enable_async()
    # cache the message for later processing
    buf.append(message)
    if len(buf) >= 3:
        for msg in buf:
            print msg
            msg.finish()
        buf = []
    else:
        print 'deferring processing'

r = nsq.Reader(message_handler=process_message,
        lookupd_http_addresses=['http://127.0.0.1:4161'],
        topic='nsq_reader', channel='async', max_in_flight=9)
nsq.run()

Give it a try.

FAQ

(Courtesy of a dialogue with Matt Reiferson)

Q: Most job-queues allow you send messages without imposing a loop. Is the 
   IOLoop required for both receiving -and- sending in pynsq?
A: Yes. pynsq supports the notion of completion-callbacks to signal when a send 
   finishes. Even if you don't use it, it's accounted-for in the mechanics. If 
   you want to send synchronous messages without the loop, hit the HTTP 
   endpoint. However, facilitating both the receive and send IOLoops allows for 
   the fasted possible dialogue, especially when the writers and readers are 
   paired to the same hosts.

Q: An IOLoop is even required for asynchronous sends?
A: Yes. If you want to simply send one-off asynchronous messages, 
   consider opening a worker process that manages delivery. It can apply its 
   own callback to catch failures, and transmit successes, failures, etc.. to 
   an IPC queue (if you need this info).

Q: Are there any delivery guarantees (like in ZeroMQ)?
A: No. It's considered good-practice by the NSQ guys to always check the 
   results of message-sends in any situation (in any kind of messaging, in 
   general). You'd do this from the callbacks, with pynsq.

    The reasons that a send would fail are the following:

    1: The topic name is not formatted correctly (to character/length 
       restrictions). There is no official documentation of this, however.
    2: The message is too large (this can be set via a parameter to nsqd).
    3: There is a breakdown related to a race-condition with a publish and a 
       delete happening on a specific topic. This is rare.
    4: Client connection-related failures.

Q: In scenario (3) of the potential reasons for a send-failure, can I mitigate 
   the publish/delete phenomena if I am either not deleting topics or have 
   orchestrated deletions such that writes eliciting topic creations will never 
   be done until a sufficient amount of time has elapsed since a deletion?
A: Largely. Though, if nowhere else, this can also happen internally to NSQ at 
   shutdown.

Q: How are new topics announced to the cluster?
A: The first writer or reader request for a topic will be applied on the 
   upstream nsqd host, and will then propagate to the nsqlookupd hosts. They will 
   eventually spread to the other readers from there. The same thing applies to 
   a new topic, as well as a previously-deleted one.

Manager Namespaces for IPC Between Python Process Pools

Arguably, one of the functionalities that best represent why Python has so many multidisciplinary uses is its multiprocessing library. This library allows Python to maintain pools of processes and communication between these processes with most of the simplicity of a standard multithreaded application (asynchronously invoking a function, locking, and IPC). This is not to say that Python can’t do threads, too, but, due to being able to quickly run map/reduce operations or asynchronous tasks using a very simple set of functions combined with the disadvantages of having to consider the GIL when doing multithreaded development, I believe the multiprocess design to be more popular by a landslide.

There are mountains of examples for how to use multiprocessing, along with sufficient documentation for most of the IPC mechanisms that can be used to communicate between processes: queues, pipes, “manager”-based shares and proxy objects, shared ctypes types, multiprocessing-based “client” and “listener” sockets, etc..

There is a very subtle IPC mechanism called a “namespace” (which is actual part of Manager), whose presence in the documentation only speaks for a couple of lines of the thousands that are there. It’s easy, and worth special mention.

from multiprocessing import Pool, Manager
from os import getpid
from time import sleep

def _worker(ns):
    pid = getpid()
    print("%d: Worker started." % (pid))

    while ns.is_running is True:
        sleep(1)

    print("%d: Worker terminating." % (pid))

m = Manager()
ns = m.Namespace()
ns.is_running = True

num_workers = 5
p = Pool(num_workers)

for i in xrange(num_workers):
    p.apply_async(_worker, (ns,))

sleep(10)
print("Shutting down.")

ns.is_running = False
p.close()
p.join()

print("All workers joined.")

The output:

52893: Worker started.
52894: Worker started.
52895: Worker started.
52896: Worker started.
52897: Worker started.
Shutting down.
52894: Worker terminating.
52893: Worker terminating.
52895: Worker terminating.
52896: Worker terminating.
52897: Worker terminating.
All workers joined.

A namespace is very much like a bulletin board, where attributes can be assigned by one process, and read by others. This works for immutable values like strings and primitive types. Otherwise, updates can’t be tracked properly:

from multiprocessing import Manager

m = Manager()
ns = m.Namespace()

ns.test_value = 'original value'
ns.test_list = [5]

print("test_value (master, original): %s" % (ns.test_value))
print("test_list (master, original): %s" % (ns.test_list))

ns.test_value = 'new value'
ns.test_list.append(10)

print("test_value (master, updated): %s" % (ns.test_value))
print("test_list (master, updated): %s" % (ns.test_list))

Output:

test_value (master, original): original value
test_list (master, original): [5]
test_value (master, updated): new value
test_list (master, updated): [5]

Though not working for some types, namespaces are a terrific mechanism for sharing counters and flags between processes.

New Tesseract OCR C Library

Tesseract is a terrific, trainable (optionally) OCR library currently maintained by Google. However, the only currently-sufficient way to use it from Python is via python-tesseract (a third-party library), and it has two flaws.

The first flaw is that python-tesseract is based on SWIG, and it introduces a lot more code. The second is that the functions may not be functionally compatible. For example, Tesseract will let you iterate through a document by “level” (word, line, paragraph, block, etc..), and allow you to incorporate its layout analysis into your application. This is useful if you need to extract parts of a document based on proximity (or, possibly, location). However, python-tesseract does not currently let you iterate through parts of the document: GetIterator() does not accept a level argument.

So, as a first step to producing a leaner and more analogous Python library, I just released CTesseract: a C-based adapter shared-library that connects to the C++ Tesseract shared-library.

OCR a Document by Section

A previous post described how to use Tesseract to OCR a document to a single string. This post will describe how to take advantage of Tesseract’s internal layout processing to iterate through the documents sections (as determined by Tesseract).

This is the core logic to iterate through the document. Each section is referred to as a “paragraph”:

if(api->Recognize(NULL) < 0)
{
    api->End();
    pixDestroy(&image);

    return 3;
}

tesseract::ResultIterator *it = api->GetIterator();

do 
{
    if(it->Empty(tesseract::RIL_PARA))
        continue;

    char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);

// para_text is the recognized text. It [usually] has a 
// newline on the end.

    delete[] para_text;
} while (it->Next(tesseract::RIL_PARA));

delete it;

To add some validation to the recognized content, we’ll also check the recognition confidence. In my experience, it seems like any recognition scoring less than 80% turned out as gibberish. In that case, you’ll have to do some preprocessing to remove some of the risk.

int confidence = api->MeanTextConf();
printf("Confidence: %d\n", confidence);
if(confidence < 80)
    printf("Confidence is low!\n");

The whole program:

#include <stdio.h>

#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>

#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    // Initialize tesseract-ocr with English, without specifying tessdata path.
    if (api->Init(NULL, "eng")) 
    {
        api->End();

        return 1;
    }

    // Open input image with leptonica library
    Pix *image;
    if((image = pixRead("receipt.png")) == NULL)
    {
        api->End();

        return 2;
    }
    
    api->SetImage(image);
    
    if(api->Recognize(NULL) < 0)
    {
        api->End();
        pixDestroy(&image);

        return 3;
    }

    int confidence = api->MeanTextConf();
    printf("Confidence: %d\n", confidence);
    if(confidence < 80)
        printf("Confidence is low!\n");

    tesseract::ResultIterator *it = api->GetIterator();

    do 
    {
        printf("=================\n");
        if(it->Empty(tesseract::RIL_PARA))
            continue;

        char *para_text = it->GetUTF8Text(tesseract::RIL_PARA);
        printf("%s", para_text);
        delete[] para_text;
    } while (it->Next(tesseract::RIL_PARA));
  
    delete it;

    api->End();
    pixDestroy(&image);

    return 0;
}

Applying this routine to an invoice (randomly found with Google), it is far easier to identify the address, tax, total, etc.. then with the previous method (which was naive about layout):

Confidence: 89
=================
Invoice |NV0010 '
To
Jackie Kensington
18 High St
Wickham
 Electrical
Sevices Limited

=================
Certificate Number CER/123-34 From
  1”” E'e°‘''°‘'’‘'Se”'°‘*’
17 Harold Grove, Woodhouse, Leeds,

=================
West Yorkshire, LS6 2EZ

=================
Email: info@ mj-e|ectrcia|.co.uk

=================
Tel: 441132816781

=================
Due Date : 17th Mar 2012

=================
Invoice Date : 16th Feb 2012

=================
Item Quantity Unit Price Line Price
Electrical Labour 4 £33.00 £132.00

=================
Installation carried out on flat 18. Installed 3 new
brass effect plug fittings. Checked fuses.
Reconnected light switch and fitted dimmer in living
room. 4 hours on site at £33 per hour.

=================
Volex 13A 2G DP Sw Skt Wht Ins BB Round Edge Brass Effect 3 £15.57 £46.71
Volex 4G 1W 250W Dimmer Brushed Brass Round Edge 1 £32.00 £32.00
Subtotal £210.71

=================
VAT £42.14

=================
Total £252.85

=================
Thank you for your business — Please make all cheques payable to ‘Company Name’. For bank transfers ‘HSBC’, sort code
00-00-00, Account no. 01010101.
MJ Electrical Services, Registered in England & Wales, VAT number 9584 158 35
 Q '|'.~..a::

OCR a Document with C++

There is an OCR library developed by HP and maintained by Google called Tesseract. It works immediately, and does not require training.

Building it is trivial. What’s more trivial is just installing it from packages:

$ sudo apt-get install libtesseract3 libtesseract-dev
$ sudo apt-get install liblept3 libleptonica-dev
$ sudo apt-get install tesseract-ocr-eng

Note that this installs the data for recognizing English.

Now, go and get the example code from the Google Code wiki for the project, and paste it into a file called ocr-test.cpp . Also, right-click and save the example document image (a random image I found with Google). You don’t have to use this particular document, as long as what is used is sufficiently clear at a high-enough resolution (the example is about 1500×2000).

Now, change the location of the file referred-to by the example code:

Pix *image = pixRead("letter.jpg");

Compile/link it:

$ g++ -o ocr-test ocr-test.cpp -ltesseract -llept

Run the example:

./ocr-test

You’re done. The following will be displayed:

OCR output:
fie’  1?/2440
Brussels,
BARROSO (2012) 1300171
BARROSO (2012)

Dear Lord Tugendhat.

Thank you for your letter of 29 October and for inviting the European Commission to

contribute in the context of the Economic Aflairs Committee's inquiry into "The

Economic lmplicationsfirr the United Kingdom of Scottish Independence ".

The Committee will understand that it is not the role of the European Commission to

express a position on questions of internal organisation related to the constitutional

arrangements of a particular Member State.

Whilst refraining from comment on possible fitture scenarios. the European Commission

has expressed its views in general in response to several parliamentary questions from

Members of the European Parliament. In these replies the European Commission has 
noted that scenarios such as the separation of one part of a Member State or the creation 
of a new state would not be neutral as regards the EU Treaties. The European 
Commission would express its opinion on the legal consequences under EU law upon ;
requestfiom a Member State detailing a precise scenario. :
The EU is founded on the Treaties which apply only to the Member States who have 3
agreed and ratified them. if part of the territory of a Member State would cease to be ,
part of that state because it were to become a new independent state, the Treaties would

no longer apply to that territory. In other words, a new independent state would, by the
fact of its independence, become a third country with respect to the E U and the Treaties

would no longer apply on its territory. ‘

../.

The Lord TUGENDHAT
Acting Chairman

House of Lords q
Committee Oflice
E-mail: economicaflairs@par1igment.ttk