Space-Filling Curves to Map Multidimensional Data to a Single Dimension

A space-filling curve (SFC) is, essentially, a line that bends through a space and hits every point. It was made possible by a proof created by Georg Cantor in the late 1800s. You can then map a coordinate in N-dimensions on to it to determine a single, identifying, scalar index. Google’s S2 library uses a specific SFC called a Hilbert curve (which is a fractal) and some other tricks to map geographic coordinates to a single 64-bit integer. This has the characteristic of assigning any coordinate within the same locality a very similar index; the closer the points are, the more of the prefix will be the same. Indeed, the difference between one SFC and another is, physically, how the curve bends and, ultimately, how it preserves locality. Hilbert is most popular for this though it is not the simplest SFC. For a list of various curves, look in the “See also” section of the Wikipedia page. Each of those references shows illustrations of what the curves look like.

As far as the index produced by S2, you can compare smaller, finer areas and larger, rougher areas by comparing a larger prefix or a smaller prefix, respectively. This not only allows you to easily search an index for places in a specific area but to also implicitly determine if one area contains another.

If you are interested in the mechanics of S2 (the Go version), the mapping magic happens in the following progression.

First, convert the LatLng to a Point (which just encapsulates a vector type provided by the R3 sibling project for three-dimensional coordinate systems):

func PointFromLatLng(ll LatLng) Point {
    phi := ll.Lat.Radians()
    theta := ll.Lng.Radians()
    cosphi := math.Cos(phi)
    return Point{r3.Vector{math.Cos(theta) * cosphi, math.Sin(theta) * cosphi, math.Sin(phi)}}
}

Second, convert the Point to face coordinates:

func cellIDFromPoint(p Point) CellID {
    f, u, v := xyzToFaceUV(r3.Vector{p.X, p.Y, p.Z})
    i := stToIJ(uvToST(u))
    j := stToIJ(uvToST(v))
    return cellIDFromFaceIJ(f, i, j)
}

Third, convert the face coordinates to the one-dimensional index (a 64-bit integer aliased as the CellID type):

func cellIDFromFaceIJ(f, i, j int) CellID {
    // Note that this value gets shifted one bit to the left at the end
    // of the function.
    n := uint64(f) <= 0; k-- {
        mask := (1 <>uint(k*lookupBits))&mask) <>uint(k*lookupBits))&mask) <>2) << (uint(k) * 2 * lookupBits)
        bits &= (swapMask | invertMask)
    }
    return CellID(n*2 + 1)
}

There is also the function that allows you to step along the curve, which is very inexpensive:

func (ci CellID) AdvanceWrap(steps int64) CellID {
    if steps == 0 {
        return ci
    }

    // We clamp the number of steps if necessary to ensure that we do not
    // advance past the End() or before the Begin() of this level.
    shift := uint(2*(maxLevel-ci.Level()) + 1)
    if steps > shift); steps > shift)
            steps %= wrap
            if steps > shift); steps > max {
            wrap := int64(wrapOffset >> shift)
            steps %= wrap
            if steps > max {
                steps -= wrap
            }
        }
    }

    // If steps is negative, then shifting it left has undefined behavior.
    // Cast to uint64 for a 2's complement answer.
    return CellID(uint64(ci) + (uint64(steps) << shift))
}

The S2 project has its own Hilbert implementation. To see an implementation that excludes all of the sphere mechanics (the presence of which makes things considerably more complicated), see Google’s hilbert (Go implementation) project.

Space-filling functions are a fantastically-useful method for clustering anything that you might have a requirement for, though you might need a mathematician to come up with a process to adjust your coordinates if they are not initially acceptable as inputs (the qualification of which might also require said mathematician). Also, SFCs are non-injective: The index that you get is not necessarily unique for any two coordinates, though, in such a situation, you are guaranteed that two coordinates resulting in an identical index would basically have to be very similar themselves (the accuracy of this depends on which SFC you use).

For reference, there also exists a reversible geographic-to-string conversion library called Geohash (also here). However, this does not use SFCs and, I believe, suffers from distortion either over the equator or for great distances though I am having trouble finding where I saw that, just now. The Hilbert/S2 approach does not have this problem.

Google’s S2, geometry on the sphere, cells and Hilbert curve (great article)

Space-filling curve

Peano curve (the first known space-filling curve)

Analysis of the Clustering Properties of Hilbert Space-filling Curve (paper)

Mapping N-dimensional value to a point on Hilbert curve (SO)

S2 (Go package)

Resolving Go Import URLs

The package path that you import may not directly correlate to a repository URL. If Go can not find the package in your GOPATH, it will load the URL, redirecting as necessary, and will either look for a “meta” tag with a “name” attribute having value “go-import” or take whatever URL we are at when no more HTTP redirects are necessary (if any).

To speed things up, I wrote a short Python tool to do this and stashed it in a gist.

Make sure to install the dependencies:

  • requests
  • beautifulsoup4

Source:

import sys
import argparse
import urllib.parse

import requests
import bs4

def _get_cycle(url):
    while 1:
        print("Reading: {}".format(url))

        r = requests.get(url, allow_redirects=False)
        r.raise_for_status()

        bs = bs4.BeautifulSoup(r.text, 'html.parser')

        def meta_filter(tag): 
            # We're looking for meta-tags like this:
            #
            # <meta name="go-import" content="googlemaps.github.io/maps git https://github.com/googlemaps/google-maps-services-go">
            
            return \
                tag.name == 'meta' and \
                tag.attrs.get('name') == 'go-import'

        for m in bs.find_all(meta_filter):
            phrase = m.attrs['content']
            _, vcs, repo_url_root = phrase.split(' ')
            if vcs != 'git':
                continue

            return repo_url_root

        next_url = r.headers.get('Location')
        if next_url is None:
            break

        p = urllib.parse.urlparse(next_url)
        if p.netloc == '':
            # Take the schema, hostname, and port from the last URL.
            p2 = urllib.parse.urlparse(url)
            updated_url = '{}://{}{}'.format(p2.scheme, p2.netloc, next_url)
            print("  [{}] => [{}]".format(next_url, updated_url))

            url = updated_url
        else:
            url = next_url

    return url

def _main():
    description = "Determine the import URL for the given Go import path"
    parser = argparse.ArgumentParser(description=description)
    
    parser.add_argument(
        'import_path',
        help='Go import path')

    args = parser.parse_args()

    initial_url = "https://{}".format(args.import_path)
    final_url = _get_cycle(initial_url)

    print("Final URL: [{}]".format(final_url))

if __name__ == '__main__':
    _main()

Geometric Hashing

Like clustering (but in the backwards Bizarro World):

This principle is widely used in computer graphics, computational geometry and many other disciplines, to solve many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a grid file, grid index, bucket grid, and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method. Geometric hashing is also used in telecommunications (usually under the name vector quantization) to encode and compress multi-dimensional signals.

I could not find a sample/pseudocode implementation quickly.

 

 

Hash function

 

Ridiculously Simple Google Hash Function

A Go implementation based on the paper “A Fast, Minimal Memory, Consistent Hash Algorithm”, describing an evenly distributed hashing function called the “Jump” algorithm:

func Hash(key uint64, numBuckets int) int32 {
    var b int64 = -1
    var j int64

    for j < int64(numBuckets) {
        b = j
        key = key*2862933555777941757 + 1
        j = int64(float64(b+1) * (float64(int64(1)<<31) / float64((key>>33)+1)))
    }

    return int32(b)
}

If you’re curious about that constant, it is “known to produce a good random number list” for 64-bit generators:

MSBuild/C#: How to Manage the Application Version Using a Text-File

Overview

C# applications have an “AssemblyInfo.cs” file that describe the assembly and executable versions of a project. Unfortunately, sometimes it is not possible to access this from the code. Other times, you need to drive this version from external sources (like a build system) and then use it for the build.

The approach this by keeping the version in a text-file:

  1. Manually set/update the version in a text-file.
  2. Install a package that helps us with string-replacements.
  3. Inject this to AssemblyInfo.cs during the build.
  4. Embed this file into the executing assembly.
  5. Extract this file file the executing assembly when you need to know it during execution.

The title of this post is a simplification for lack of an easy way to succinctly describe five steps in a couple of words.

Do It

Feel free to modify/customize these steps as suits your needs.

1. Create the Version File

Create a file called “executable.version” in the “Properties\” folder of your executable project. Make sure to include this in your project. In the “Properties” window, set “Build Action” to “Embedded Resource”.

2. Install the “MSBuild Community Tasks” NuGet Package

This is the “MSBuildTasks” package. This provides us a regular-expression string-replacement MSBuild task.

3. Create a template “AssemblyVersion.cs” File

Copy “Properties\AssemblyInfo.cs” to “Properties\AssemblyInfo.cs.use_this” and update the two version attributes as the bottom to be the following:

[assembly: AssemblyVersion("__EXECUTABLE_VERSION__")]
[assembly: AssemblyFileVersion("__EXECUTABLE_VERSION__")]

Make sure to include this new file in the project. Note that we name this so as to not have the “.cs” extension because, otherwise, Visual Studio will try to parse it and complain about the attributes being duplicated from the “AssemblyInfo.cs” file.

4. Add the Build Step

We are going to add a custom build target to inject the version. We personally chose to put this into a separate rules file in order to make it clear which of the build-logic was ours, but this is up to you. It would just as easily work if it were included at the bottom of your project-file. Create “Properties\build.targets” with the following:

(build.targets)

<?xml version="1.0" encoding="utf-8" ?>
<Project ToolsVersion="12.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

<!-- Inject a version from a text-file into AssemblyVersion.cs . We do this 
 so that it's easier for the application to know its own version [by 
 reading the text file].
 -->
 <Import Project="$(ProjectDir)..\packages\MSBuildTasks.1.5.0.196\tools\MSBuild.Community.Tasks.Targets" /> 
 <Target Name="InjectVersion" BeforeTargets="BeforeBuild">
 <!-- Read the version from our text file. This appears to automatically 
 trim (probably per line). This is located in the project root so 
 that we copy the file to the output-path rather than establishing 
 a whole Properties/ directory in the output path.
 -->
 <ReadLinesFromFile File="$(ProjectDir)Properties\executable.version">
 <Output TaskParameter="Lines" PropertyName="ExecutableVersion" />
 </ReadLinesFromFile>

<!-- Print it to the build output whether we're in debug-mode or not. -->
 <Message Importance="High" Text="Executable version is [$(ExecutableVersion)]"/>

<!-- Copy our template file to the output file. -->
 <Copy SourceFiles="$(ProjectDir)Properties/AssemblyInfo.cs.use_this" DestinationFiles="$(ProjectDir)Properties/AssemblyInfo.cs"/>

<!-- Do an RX replace of the version on to the token. -->

<ItemGroup>
 <WriteFiles Include='$(ProjectDir)Properties/AssemblyInfo.cs' />
 </ItemGroup>

<FileUpdate 
 Files="@(WriteFiles)"
 Regex="__EXECUTABLE_VERSION__"
 ReplacementText="$(ExecutableVersion)"
 />

<!-- Replace the cautionary note about how to use the file with one 
 saying that any changes will be lost (if made to the output file). 
 -->
 <FileUpdate 
 Files="@(WriteFiles)"
 Regex="// TEMPLATE:.+"
 ReplacementText="// THIS FILE IS GENERATED! Apply any changes to 'AssemblyInfo.cs.use_this', instead."
 />
 </Target>
</Project>

IMPORTANT: Notice that we have to import the build targets provided by the “MSBuildTasks” package:

$(ProjectDir)..\packages\MSBuildTasks.1.5.0.196\tools\MSBuild.Community.Tasks.Targets

For us, NuGet packages go into the “packages” directory that is in the parent directory of our project directory. Also notice that we have to embed the version for this NuGet package. If your package is a different version or is located in a different place, you will have to update the example to be accurate.

NOTE: One way to get around having to embed the version is to bypass putting this package in your “packages.config file” and, instead, do a manual NuGet install of this package from a build-task to your packages directory (whereever it is) while also passing the “-ExcludeVersion” argument so as to not put the version in the package’s directory name.

Now, import the “build.targets” file from your project file. Put it somewhere near the bottom. Since it will run before the “BeforeBuild” target, we put it before that (which will be commented-out unless you use it):

 <Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />
 <Import Project="Properties\build.targets" />

5. Reading the Version From the Application

At this point, you should be able to build your project. The only thing that might be considered a disadvantage to this method is that, every time you build your project from inside Visual Studio, you will be prompted to reload the “AssemblyInfo.cs” file because it has been updated from outside of VS even if it has not changed (which is no stupider than the amount of work that we are required to do in order to find our own version). It would be easiest to check the box in this popup that says to only tell you if you happen to have unsaved changes to a file that has been changed from outside VS.

In our case, we are using the CLAP command-line parser. So, we added a private “ExecutableVersion” getter on the class that we are using to handle our subcommands. Then, we added a “version” subcommand that reads and prints the new property. Code for the property:

(Program.cs)

private string executableVersion = null;

private string ExecutableVersion
{
    get
    {
        if (executableVersion == null)
        {
            Assembly assembly = Assembly.GetExecutingAssembly();
            string assemblyName = assembly.GetName().Name;

            // "Properties" is required since it is located in the 
            // Properties folder of the project and was thusly embedded 
            // as such.
            string filepath = assemblyName + @".Properties.executable.version";

            string[] names = assembly.GetManifestResourceNames();
            var stream = assembly.GetManifestResourceStream(filepath);
            if (stream == null)
            {
                throw new Exception(String.Format("Could not get resource-stream with name [{0}] for version content from assembly [{1}]. Available: {2}", filepath, assembly.FullName, String.Join(",", names)));
            }

            TextReader tr = new StreamReader(stream);
            executableVersion = tr.ReadToEnd().Trim();
        }

        return executableVersion;
    }
}