How to use Python’s “diff” functionality

Python natively provides the ability to compare two documents, and to produce a set of patch instructions to get from one to the other. This is often useful to 1) provide quick insight into the differences between the two (if any), or 2) provide a list of changes to derive a later version of a document, which is typically much lighter than sending the whole, updated document if the original document is already available on the receiving end.

This is an example of the few lines of code required to generate a list of those instructions, and how to apply them to derive the updated document.

Our documents, for the purpose of this example:

original = """
line1
line2
line3
line4
line5
"""

updated = """
line3
line4
line5
line6
line7
"""

If all you want is to get a list of adds and removes, use ndiff:

from difflib import ndiff

def get_updates(original, updated):
    """Return a 2-tuple of (adds, removes) describing the changes to get from
    ORIGINAL to UPDATED.
    """

    diff = ndiff(original.split("\n"), updated.split("\n"))

    adds = set()
    deletes = set()
    for row in diff:
        diff_type = row[0]
        if diff_type == ' ':
            continue

        entry = row[2:]

        if diff_type == '+':
            adds.add(entry)
        elif diff_type == '-':
            deletes.add(entry)

    return (list(adds), list(deletes))

Run:

updates = get_updates(original, updated)

updates contains a 2-tuple of adds and removes, respectively:

(['line7', 'line6'], ['line2', 'line1'])

If, on the other hand, you do actually need a full set of patch instructions, use SequenceMatcher:

from difflib import SequenceMatcher

def get_transforms(original, updated):
    """Get a list of patch instructions to get from ORIGINAL to UPDATED."""

    s = SequenceMatcher(None, original, updated)

    tag_mapping = { 'delete': '-',
                    'insert': '+',
                    'replace': '>' }

    transforms = []
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        if tag == 'delete':
            transform = ('-', (i1, i2))
        elif tag == 'insert':
            transform = ('+', (i1, i2), updated[j1:j2])
        elif tag == 'replace':
            transform = ('>', (i1, i2), updated[j1:j2])
        else:
            transform = ('=', (i1, i2), (j1, j2))

        transforms.append(transform)

    return transforms

def apply_transforms(original, transforms):
    """Execute the transform instructions returned from get_transforms() to
    derive UPDATED from ORIGINAL.
    """

    updated = []
    for transform in transforms:
        if transform[0] == '-':
            pass
        elif transform[0] == '+':
            updated.append(transform[2])
        elif transform[0] == '>':
            updated.append(transform[2])
        else: # Equals.
            updated.append(original[transform[1][0]:transform[1][1]])

    return ''.join(updated)

Run:

transforms = get_transforms(original, updated)

transforms contains:

[('-', (0, 12)), ('=', (12, 31), (0, 19)), ('+', (31, 31), 'line6\nline7\n')]

To derive updated from original:

updated_derived = apply_transforms(original, transforms)
print(updated == updated_derived)

Which displays:

True
Advertisements