How to use Python’s “diff” functionality

Python natively provides the ability to compare two documents, and to produce a set of patch instructions to get from one to the other. This is often useful to 1) provide quick insight into the differences between the two (if any), or 2) provide a list of changes to derive a later version of a document, which is typically much lighter than sending the whole, updated document if the original document is already available on the receiving end.

This is an example of the few lines of code required to generate a list of those instructions, and how to apply them to derive the updated document.

Our documents, for the purpose of this example:

original = """
line1
line2
line3
line4
line5
"""

updated = """
line3
line4
line5
line6
line7
"""

If all you want is to get a list of adds and removes, use ndiff:

from difflib import ndiff

def get_updates(original, updated):
    """Return a 2-tuple of (adds, removes) describing the changes to get from
    ORIGINAL to UPDATED.
    """

    diff = ndiff(original.split("\n"), updated.split("\n"))

    adds = set()
    deletes = set()
    for row in diff:
        diff_type = row[0]
        if diff_type == ' ':
            continue

        entry = row[2:]

        if diff_type == '+':
            adds.add(entry)
        elif diff_type == '-':
            deletes.add(entry)

    return (list(adds), list(deletes))

Run:

updates = get_updates(original, updated)

updates contains a 2-tuple of adds and removes, respectively:

(['line7', 'line6'], ['line2', 'line1'])

If, on the other hand, you do actually need a full set of patch instructions, use SequenceMatcher:

from difflib import SequenceMatcher

def get_transforms(original, updated):
    """Get a list of patch instructions to get from ORIGINAL to UPDATED."""

    s = SequenceMatcher(None, original, updated)

    tag_mapping = { 'delete': '-',
                    'insert': '+',
                    'replace': '>' }

    transforms = []
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        if tag == 'delete':
            transform = ('-', (i1, i2))
        elif tag == 'insert':
            transform = ('+', (i1, i2), updated[j1:j2])
        elif tag == 'replace':
            transform = ('>', (i1, i2), updated[j1:j2])
        else:
            transform = ('=', (i1, i2), (j1, j2))

        transforms.append(transform)

    return transforms

def apply_transforms(original, transforms):
    """Execute the transform instructions returned from get_transforms() to
    derive UPDATED from ORIGINAL.
    """

    updated = []
    for transform in transforms:
        if transform[0] == '-':
            pass
        elif transform[0] == '+':
            updated.append(transform[2])
        elif transform[0] == '>':
            updated.append(transform[2])
        else: # Equals.
            updated.append(original[transform[1][0]:transform[1][1]])

    return ''.join(updated)

Run:

transforms = get_transforms(original, updated)

transforms contains:

[('-', (0, 12)), ('=', (12, 31), (0, 19)), ('+', (31, 31), 'line6\nline7\n')]

To derive updated from original:

updated_derived = apply_transforms(original, transforms)
print(updated == updated_derived)

Which displays:

True
Advertisements

Writing Your Own Timezone Implementation for Python

Python has the concept of “naive” and “aware” times. The former refers to a timezone-capable date/time object that hasn’t been assigned a timezone, and the latter refers to one that has.

However, Python only provides an interface for “tzinfo” implementations: classes that define a particular timezone. It does not provide the implementations themselves. So, you either have to do your own implementations, or use something like the widely used “pytz” or “pytzpure” (a pure-Python version).

This is a quick example of how to write your own, courtesy of Google:

from datetime import tzinfo, timedelta, datetime


class _TzBase(tzinfo):
    def utcoffset(self, dt):
        return timedelta(hours=self.get_offset()) + self.dst(dt)

    def _FirstSunday(self, dt):
        """First Sunday on or after dt."""
        return dt + timedelta(days=(6 - dt.weekday()))

    def dst(self, dt):
        # 2 am on the second Sunday in March
        dst_start = self._FirstSunday(datetime(dt.year, 3, 8, 2))
        # 1 am on the first Sunday in November
        dst_end = self._FirstSunday(datetime(dt.year, 11, 1, 1))

        if dst_start <= dt.replace(tzinfo=None) < dst_end:
            return timedelta(hours=1)
        else:
            return timedelta(hours=0)

    def tzname(self, dt):
        if self.dst(dt) == timedelta(hours=0):
            return self.get_tz_name()
        else:
            return self.get_tz_with_dst_name()

    def get_offset(self):
        """Returns the offset in hours (-5)."""
        
        raise NotImplementedError()

    def get_tz_name(self):
        """Returns the standard acronym (EST)."""
        
        raise NotImplementedError()
    
    def get_tz_with_dst_name(self):
        """Returns the DST version of the acronym ('EDT')."""        
        
        raise NotImplementedError()


class TzGmt(_TzBase):
    """Implementation of the EST timezone."""

    def get_offset(self):
        return 0

    def get_tz_name(self):
        return 'GMT'
    
    def get_tz_with_dst_name(self):
        return 'GMT'


class TzEst(_TzBase):
    """Implementation of the EST timezone."""

    def get_offset(self):
        return -5

    def get_tz_name(self):
        return 'EST'
    
    def get_tz_with_dst_name(self):
        return 'EDT'

Use it, like so:

from datetime import datetime

now_est = datetime.now().replace(tzinfo=TzEst())
now_gmt = now_est.astimezone(TzGmt())

This produces a datetime object with an EST timezone, and then uses it to produce a GMT time.

AppEngine Development Environment Module Restrictions

AppEngine has some very tight but obvious restrictions on what types of Python modules can be invoked from application code. The general rule of thumb is that modules that need filesystem access or C code can’t be used. So, which modules are allowed or disallowed? Which modules are partially implemented, or defined and completely empty (yes, there are/were some)?

Unfortunately, the only official list of such modules is very dated.

There was a point, in the not-too-distant past, that the reigning perception of AppEngine’s module support was that the development environment does no such restriction, leaving a dangerous and scary gap between what will definitely run on your system and what you can be sure will run in production.

It turns out that there is some protection in the development environment.. Maybe even complete protection.

The google/appengine/tools/devappserver2/python/sandbox.py module appears to be wholly responsible for the loading of modules. At the top, there’s a sys.meta_path assignment. This is what appears as of version 1.8.4:

  sys.meta_path = [
      StubModuleImportHook(),
      ModuleOverrideImportHook(_MODULE_OVERRIDE_POLICIES),
      BuiltinImportHook(),
      CModuleImportHook(enabled_library_regexes),
      path_override_hook,
      PyCryptoRandomImportHook,
      PathRestrictingImportHook(enabled_library_regexes)
      ]

This defines a series of module “finders” responsible for resolving imported modules. This is where restrictions are imposed. The following are descriptions/insights about each one.

StubModuleImportHook: Replaces complete modules with different ones.
ModuleOverrideImportHook: Adjust partially white-listed modules (symbols may be added, removed, or updated).
BuiltinImportHook: Imposes a white-list on builtin modules. This raises an ImportError on everything else.
CModuleImportHook: Imposes a white-list on C modules.
path_override_hook: Has an instance of PathOverrideImportHook. It looks like this module looks for modules in special paths (the kind scattered in the.
PyCryptoRandomImportHook: Fixes the loading of Crypto.Random.OSRNG.new .
PathRestrictingImportHook: Makes sure any remaining imports come out of an accessible path.

If you have a question of what specific modules are involved, look in the sandbox.py module mentioned above. The first four finders are relatively concrete. Most of their modules are expressed in lists.

A Pure-Python Implementation of “pytz”

There is a problem with the standard “pytz” package: It’s awesome, but can’t be used on systems that don’t allow direct file access. I created “pytzpure” to account for this. It allows you to build-out data files as Python modules. As long as these modules are put into the path, the “pytzpure” module will provide the same exports as the original “pytz” package.

For export:

PYTHONPATH=. python pytzpure/tools/tz_export.py /tmp/tzppdata

Output:

Verifying export path exists: /tmp/tzppdata
Verifying __init__.py .
Writing zone tree.
(578) timezones written.
Writing country timezones.
Writing country names.

To use:

from datetime import datetime
from pytzpure import timezone
utc = timezone('UTC')
detroit = timezone('America/Detroit')
datetime.utcnow().replace(tzinfo=utc).astimezone(detroit).\
strftime('%H:%M:%S %z')
'16:34:37 -0400'

Dumping Raw Python from Dictionary

I wrote a simple tool to generate a Python string-representation of the given data. Note that this renders data very similar to JSON, with the exception of the handling of NULLs.

Example usage:

get_as_python({ 'data1': { 'data22': { 'data33': 44 }},
                'data2': ['aa','bb','cc'],
                'data3': ('dd','ee','ff',None) })

Output (notice that a dict does not carry order, as expected):

data1 = {"data22":{"data33":44}}
data3 = ["dd","ee","ff",None]
data2 = ["aa","bb","cc"]

https://raw.github.com/dsoprea/RandomUtility/master/get_as_python.py

PySecure is now Python 3 Compatible

Changes to PySecure for Python 3 compatibility have now been checked in and pushed to PyPI.

A large amount of the labor went into refactoring nearly every occurrence of strings for string/bytes correctness. I also did an internal refactor of all of the tests (which largely just invoke a bunch of the functionalities and rely on the right exceptions to fail out when they should).

Unfortunately, I discovered that libssh’s reverse port-forwarding appears to be broken in 0.6.0 (which is incompatible with 0.5.5, for its authentication calls). This has been registered as bug #126 in their tracker.

Using “dialog” for Nice, Easy, C-Based Console Dialogs

dialog is a great command-line-based dialog tool that let’s you construct twenty-three types of dialog screens, that resemble the best of any available dialog utilities.

It’s as simple as running the following from the command-line:

dialog --yesno "Yes or no, please." 6 30

Very few of the users of dialog probably know that it can be statically linked to provide the same functionality in a C application. It doesn’t help that there is almost no documentation on the subject.

This is an example of how to create a “yesno” dialog:

#include <curses.h>
#include <dialog.h>

int main()
{
    int rc;
    init_dialog(stdin, stderr);
    rc = dialog_yesno("title", "message", 0, 0);
    end_dialog();

    return rc;
}

I explicitly pre-include curses.h so dialog.h won’t go looking in the wrong place. It might be different in your situation.

To build:

gcc -o example example.c -L dialogpath -I dialogpath -ldialog -lncurses -lm

Just configure and build your dialog sources, and then use that path in the make line, above.

This program will return an integer representing which button was pressed (true/0, false/1), or whether the dialog was cancelled with ESC (255).