Python: Parsing XML and Retaining the Comments

By default, Python’s built-in ElementTree module strips comments as it reads them. The solution is just obscure enough to be hard to find.

import xml.etree.ElementTree as ET

class _CommentedTreeBuilder(ET.TreeBuilder):
    def comment(self, data):
        self.start('!comment', {})
        self.data(data)
        self.end('!comment')

def parse(filepath):
    ctb = _CommentedTreeBuilder()
    xp = ET.XMLParser(target=ctb)
    tree = ET.parse(filepath, parser=xp)

    root = tree.getroot()
    # ...

When enumerating the parsed nodes, the comments will have a tag-name of “!comment”.

Repo: How to Parse and Use a Manifest Directly From Python

Repo is a tool from AOSP (Android) that allows you to manage a vast hierarchy of individual Git repositories. It’s basically a small Python tool that adds some abstraction around Git commands. The manifest that controls the project tree is written in XML, can include submanifests, can assign projects into different groups (so you do not have to clone all of them every time), can include additional command primitives to do file copies and tweak how the manifests are loaded, etc. The manifest is written against a basic specification but, still, it is a lot easier to find a way to avoid doing this yourself.

You can access the built-in manifest-parsing functionality directly from the Repo tool. We can also use the version of the tool that’s embedded directly in the Repo tree.

For example, to load a manifest:

/tree/.repo/repo$ python
>>> import manifest_xml
>>> xm = manifest_xml.XmlManifest('/tree/.repo')

Obviously, you’ll be [temporarily] manipulating the sys.path to load this from your integration.

To explore, you can play with the “projects” (list of project objects) and “paths” properties (a dictionary of paths to project objects).

Number of projects:

>>> print(len(xm.projects))
878
>>> print(len(xm.paths))
878

paths is a dictionary.

A project object looks like:

>>> p = xm.projects[0]
>>> p


>>> dir(p)
['AbandonBranch', 'AddAnnotation', 'AddCopyFile', 'AddLinkFile', 'CheckoutBranch', 'CleanPublishedCache', 'CurrentBranch', 'Derived', 'DownloadPatchSet', 'Exists', 'GetBranch', 'GetBranches', 'GetCommitRevisionId', 'GetDerivedSubprojects', 'GetRegisteredSubprojects', 'GetRemote', 'GetRevisionId', 'GetUploadableBranch', 'GetUploadableBranches', 'HasChanges', 'IsDirty', 'IsRebaseInProgress', 'MatchesGroups', 'PostRepoUpgrade', 'PrintWorkTreeDiff', 'PrintWorkTreeStatus', 'PruneHeads', 'StartBranch', 'Sync_LocalHalf', 'Sync_NetworkHalf', 'UncommitedFiles', 'UploadForReview', 'UserEmail', 'UserName', 'WasPublished', '_ApplyCloneBundle', '_CheckDirReference', '_CheckForSha1', '_Checkout', '_CherryPick', '_CopyAndLinkFiles', '_ExtractArchive', '_FastForward', '_FetchArchive', '_FetchBundle', '_GetSubmodules', '_GitGetByExec', '_InitAnyMRef', '_InitGitDir', '_InitHooks', '_InitMRef', '_InitMirrorHead', '_InitRemote', '_InitWorkTree', '_IsValidBundle', '_LoadUserIdentity', '_Rebase', '_ReferenceGitDir', '_RemoteFetch', '_ResetHard', '_Revert', '_UpdateHooks', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_allrefs', '_getLogs', '_gitdir_path', '_revlist', '_userident_email', '_userident_name', 'annotations', 'bare_git', 'bare_objdir', 'bare_ref', 'clone_depth', 'config', 'copyfiles', 'dest_branch', 'enabled_repo_hooks', 'getAddedAndRemovedLogs', 'gitdir', 'groups', 'is_derived', 'linkfiles', 'manifest', 'name', 'objdir', 'old_revision', 'optimized_fetch', 'parent', 'rebase', 'relpath', 'remote', 'revisionExpr', 'revisionId', 'shareable_dirs', 'shareable_files', 'snapshots', 'subprojects', 'sync_c', 'sync_s', 'upstream', 'work_git', 'working_tree_dirs', 'working_tree_files', 'worktree']

The relative path for the project:

>>> path = p.relpath
>>> xm.paths[path]

The revision for the project:

>>> p.revisionExpr
u'master'

The remote for the project:

>>> p.GetRemote('origin').url
u'ssh://gerrit.company.com:2537/android/platform/external/lzma'

You can also get a config object representing the Git config for the bare archive of the project:

>>> p.config


>>> dir(p.config)
['ForRepository', 'ForUser', 'GetBoolean', 'GetBranch', 'GetRemote', 'GetString', 'GetSubSections', 'Global', 'Has', 'HasSection', 'SetString', 'UrlInsteadOf', '_ForUser', '_Global', '_Read', '_ReadGit', '_ReadJson', '_SaveJson', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_branches', '_cache', '_cache_dict', '_do', '_json', '_remotes', '_section_dict', '_sections', 'defaults', 'file']

>>> p.config.file
u'/tree/.repo/projects/external/lzma.git/config'

An example of how to efficiently establish a tree of projects to paths:

_MAPPING_CACHE = {}

def get_repo_project_to_path_mapping(path):
    try:
        return _MAPPING_CACHE[path]
    except KeyError:
        pass

    repo_meta_path = os.path.join(path, '.repo')
    repo_tool_path = os.path.join(repo_meta_path, 'repo')

    if repo_tool_path not in sys.path:
        sys.path.insert(0, repo_tool_path)

    import manifest_xml

    xm = manifest_xml.XmlManifest(repo_meta_path)
    project_to_path_mapping = {}
    for path, p in xm.paths.items():
        project_to_path_mapping[str(p.name)] = str(path)

    _MAPPING_CACHE[path] = project_to_path_mapping
    return project_to_path_mapping

C#: Parsing a CSPROJ (Project) File Using XPath

Using XPath in C# can be done several different ways through a several built-in libraries, and none of them work unless you are a lot more familiar with the file than would be required in many other languages. However, to make matters worse, you might be further required to do some unintuitive shenanigans. In the way of an example, this is how to retrieve the assembly-name:

XNamespace xmlns = "http://schemas.microsoft.com/developer/msbuild/2003";
XDocument projDefinition = XDocument.Load(projectFilepath);

IEnumerable<XNode> assemblyResultsEnumerable = projDefinition
	.Element(xmlns + "Project")
	.Elements(xmlns + "PropertyGroup")
	.Elements(xmlns + "AssemblyName").Nodes<XContainer>();

IList<XNode> assemblyResults = new List<XNode>(assemblyResultsEnumerable);
if(assemblyResults.Count == 0)
{
	throw new Exception(String.Format("The project file isn't correctly structured: [{0}]", projectFilepath));
}

string assemblyName = assemblyResults[0].ToString();

Notice that we have to mash the namespace URL with the node-name in order to find the node.

Efficiently Processing GPX Files in Go

Use gpxreader to process a GPX file of any size without reading the whole thing into memory. This also avoids Go’s issue where the decoder can decode one node at a time, but, when you do that, it implicitly ignores all child nodes (because it automatically seeks to the matching close tag for validation without any ability to disable this behavior).

An excerpt of the test-script from the project:

//...

func (gv *gpxVisitor) GpxOpen(gpx *gpxreader.Gpx) error {
    fmt.Printf("GPX: %s\n", gpx)

    return nil
}

func (gv *gpxVisitor) GpxClose(gpx *gpxreader.Gpx) error {
    return nil
}

func (gv *gpxVisitor) TrackOpen(track *gpxreader.Track) error {
    fmt.Printf("Track: %s\n", track)

    return nil
}

func (gv *gpxVisitor) TrackClose(track *gpxreader.Track) error {
    return nil
}

func (gv *gpxVisitor) TrackSegmentOpen(trackSegment *gpxreader.TrackSegment) error {
    fmt.Printf("Track segment: %s\n", trackSegment)

    return nil
}

func (gv *gpxVisitor) TrackSegmentClose(trackSegment *gpxreader.TrackSegment) error {
    return nil
}

func (gv *gpxVisitor) TrackPointOpen(trackPoint *gpxreader.TrackPoint) error {
    return nil
}

func (gv *gpxVisitor) TrackPointClose(trackPoint *gpxreader.TrackPoint) error {
    fmt.Printf("Point: %s\n", trackPoint)

    return nil
}

//...

func main() {
    var gpxFilepath string

    o := readOptions()

    gpxFilepath = o.GpxFilepath

    f, err := os.Open(gpxFilepath)
    if err != nil {
        panic(err)
    }

    defer f.Close()

    gv := newGpxVisitor()
    gp := gpxreader.NewGpxParser(f, gv)

    err = gp.Parse()
    if err != nil {
        print("Error: %s\n", err.Error())
        os.Exit(1)
    }
}

Output:

$ gpxreadertest -f 20140909.gpx
GPX: GPX
Track: Track
Track segment: TrackSegment
Point: TrackPoint
Point: TrackPoint
Point: TrackPoint