Recursively Scanning a Path with Filters under Python

Python has a great function to walk a tree called os.walk(). It’s a simple generator (meaning that you just enumerate it), and, at each node (a specific child path) it gives you 1) the current path, 2) a list of child directories, and 3) a list of child files. You can even use it in such a way that you can adjust what child directories it will walk on-the-fly. However, it doesn’t take any filters. What if you just want to give it inclusion/exclusion rules and then see the matching results?

Enter pathscan. This library will silently start a background-worker (as a process) to scan the directory structure in parallel while forwarding results to the foreground. To install, just install the pathscan library. It requires Python 3.4.

The library runs as a generator:

import fss.constants
import fss.config.log
import fss.orchestrator

root_path = '/etc'

filter_rules = [
    (fss.constants.FT_DIR, fss.constants.FILTER_INCLUDE, 'init'),
    (fss.constants.FT_FILE, fss.constants.FILTER_INCLUDE, 'net*'),
    (fss.constants.FT_FILE, fss.constants.FILTER_EXCLUDE, 'networking.conf'),
]

o = fss.orchestrator.Orchestrator(root_path, filter_rules)
for (entry_type, entry_filepath) in o.recurse():
    if entry_type == fss.constants.FT_DIR:
        print("Directory: [%s]" % (entry_filepath,))
    else: # entry_type == fss.constants.FT_FILE:
        print("File: [%s]" % (entry_filepath,))

# Directory: [/etc/init]
# File: [/etc/networks]
# File: [/etc/netconfig]
# File: [/etc/init/network-interface-container.conf]
# File: [/etc/init/networking.conf]
# File: [/etc/init/network-interface-security.conf]
# File: [/etc/init/network-interface.conf]

A command-line tool is also included:

$ pathscan -i "i*.h" -id php /usr/include 
F /usr/include/iconv.h
F /usr/include/ifaddrs.h
F /usr/include/inttypes.h
F /usr/include/iso646.h
D /usr/include/php

Very Easy, Pleasant, Secure, and Python-Accessible Distributed Storage With Tahoe LAFS

Tahoe is a file-level distributed filesystem, and it’s a joy to use. “LAFS” stands for “Least Authority Filesystem”. According to the homepage:

Even if some of the servers fail or are taken over by an attacker, the 
entire filesystem continues to function correctly, preserving your privacy 
and security.

Tahoe comes built-in with a beautiful UI, and can be accessed via it’s CLI (using a syntax similar to SCP), via REST (that’s right), or from Python using pyFilesystem (an abstraction layer that also works with SFTP, S3, FTP, and many others). Tahoe It gives you very direct control over how files are sharded/replicated. The shards are referred to as shares.

Tahoe requires an “introducer” node that announces nodes. You can easily do a one-node cluster by installing the node in the default ~/.tahoe directory, the introducer in another directory, and dropping the “share” configurables down to 1.

Installing

Just install the package:

$ sudo apt-get install tahoe-lafs

You might also be able to install directly using pip (this is what the Apt version does):

$ sudo pip install allmydata-tahoe

Configuring as Client

  1. Provisioned client:
    $ tahoe create-client
    
  2. Update ~/.tahoe/tahoe.cfg:
    # Identify the local node.
    nickname = 
    
    # This is the furl for the public TestGrid.
    introducer.furl = pb://hckqqn4vq5ggzuukfztpuu4wykwefa6d@publictestgrid.e271.net:50213,198.186.193.74:50213/introducer
    
  3. Start node:
    $ bin/tahoe start
    

Web Interface (WUI):

The UI is available at http://127.0.0.1:3456.

To change the UI to bind on all ports, update web.port:

web.port = tcp:3456:interface=0.0.0.0

CLI Interface (CLI):

To start manipulating files with tahoe, we need an alias. Aliases are similar to anonymous buckets. When you create an alias, you create a bucket. If you misplace the alias (or the directory URI that it represents), you’re up the creek. It’s standard-operating-procedure to copy the private/aliases file (in your main Tahoe directory) between the various nodes of your cluster.

  1. Create an alias (bucket):
    $ tahoe create-alias tahoe
    

    We use “tahoe” since that’s the conventional default.

  2. Manipulate it:

    $ tahoe ls tahoe:
    $
    

The tahoe command is similar to scp, in that you pass the standard file management calls and use the standard “colon” syntax to interact with the remote resource.

If you’d like to view this alias/directory/bucket in the WUI, run “tahoe list-aliases” to dump your aliases:

# tahoe list-aliases
  tahoe: URI:DIR2:xyzxyzxyzxyzxyzxyzxyzxyz:abcabcabcabcabcabcabcabcabcabcabc

Then, take the whole URI string (“URI:DIR2:xyzxyzxyzxyzxyzxyzxyzxyz:abcabcabcabcabcabcabcabcabcabcabc”), plug it into the input field beneath “OPEN TAHOE-URI:”, and click “View file or Directory”.

Configuring as Peer (Client and Server)

First, an introducer has to be created to announce the nodes.

Creating the Introducer

$ mkdir tahoe_introducer
$ cd tahoe_introducer/
~/tahoe_introducer$ tahoe create-introducer .

Introducer created in '/home/dustin/tahoe_introducer'

$ ls -l
total 8
-rw-rw-r-- 1 dustin dustin 520 Sep 16 13:35 tahoe.cfg
-rw-rw-r-- 1 dustin dustin 311 Sep 16 13:35 tahoe-introducer.tac

# This is a introducer-specific tahoe.cfg . Set the nickname.
~/tahoe_introducer$ vim tahoe.cfg 

~/tahoe_introducer$ tahoe start .
STARTING '/home/dustin/tahoe_introducer'

~/tahoe_introducer$ cat private/introducer.furl 
pb://wa3mb3l72aj52zveokz3slunvmbjeyjl@192.168.10.108:58294,192.168.24.170:58294,127.0.0.1:58294/5orxjlz6e5x3rtzptselaovfs3c5rx4f

Configuring Client/Server Peer

  1. Create the node:
    $ tahoe create-node
    
  2. Update configuration (~/.tahoe/tahoe.cfg).
    • Set nickname and introducer.furl to the furl of the introducer, just above.
    • Set the shares config. We’ll only have one node for this example, so needed represents the number of pieces required to rebuild a file, happy represents the number of pieces/nodes required to perform a write, and total represents the number of pieces that get created:
      shares.needed = 1
      shares.happy = 1
      shares.total = 1
      

      You may also wish to set the web.port item as we did in the client section, above.

  3. Start the node:

    $ tahoe start
    STARTING '/home/dustin/.tahoe'
    
  4. Test a file-operation:
    $ tahoe create-alias tahoe
    Alias 'tahoe' created
    
    $ tahoe ls
    
    $ tahoe cp /etc/fstab tahoe:
    Success: files copied
    
    $ tahoe ls
    fstab
    

Accessing From Python

  1. Install the Python package:
    $ sudo pip install fs
    
  2. List the files:
    import fs.contrib.tahoelafs
    
    dir_uri = 'URI:DIR2:um3z3xblctnajmaskpxeqvf3my:fevj3z54toroth5eeh4koh5axktuplca6gfqvht26lb2232szjoq'
    webapi_url = 'http://yourserver:3456'
    
    t = fs.contrib.tahoelafs.TahoeLAFS(dir_uri, webapi=webapi_url)
    files = t.listdir()
    

    This will render a list of strings (filenames). If you don’t provide webapi, the local system and default port are assumed.

Troubleshooting

If the logo in the upper-lefthand corner of the UI doesn’t load, try doing the following, making whatever path adjustments are necessary in your environment:

$ cd /usr/lib/python2.7/dist-packages/allmydata/web/static
$ sudo mkdir img && cd img
$ sudo wget https://raw.githubusercontent.com/tahoe-lafs/tahoe-lafs/master/src/allmydata/web/static/img/logo.png
$ tahoe restart

This is a bug, where the image isn’t being included in the Python package:

logo.png is not found in allmydata-tahoe as installed via easy_install and pip

If you’re trying to do a copy and you get an AssertionError, this likely is a known bug in 1.10.0:

# tahoe cp tahoe:fake_data .
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/runner.py", line 156, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/runner.py", line 141, in runner
    rc = cli.dispatch[command](so)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/cli.py", line 551, in cp
    rc = tahoe_cp.copy(options)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 770, in copy
    return Copier().do_copy(options)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 451, in do_copy
    status = self.try_copy()
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 512, in try_copy
    return self.copy_to_directory(sources, target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 672, in copy_to_directory
    self.copy_files_to_target(self.targetmap[target], target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 703, in copy_files_to_target
    self.copy_file_into(source, name, target)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 748, in copy_file_into
    target.put_file(name, f)
  File "/usr/lib/python2.7/dist-packages/allmydata/scripts/tahoe_cp.py", line 156, in put_file
    precondition(isinstance(name, unicode), name)
  File "/usr/lib/python2.7/dist-packages/allmydata/util/assertutil.py", line 39, in precondition
    raise AssertionError, "".join(msgbuf)
AssertionError: precondition: 'fake_data' <type 'str'>

Try using a destination filename/filepath rather than just a dot.

See Inconsistent ‘tahoe cp’ behavior for more information.

ZFS for Volume Management and RAID

ZFS is an awesome filesystem, developed by Sun and ported to Linux. Although not distributed, it emphasizes durability and simplicity. It’s essentially an alternative to the common combination of md and LVM.

I’m not going to actually go into a RAID configuration, here, but the following should be intuitive-enough to send you on your way. I’m using Ubuntu 13.10 .

$ sudo apt-get install zfs-fuse 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  nfs-kernel-server kpartx
The following NEW packages will be installed:
  zfs-fuse
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 1,258 kB of archives.
After this operation, 3,302 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu/ saucy/universe zfs-fuse amd64 0.7.0-10.1 [1,258 kB]
Fetched 1,258 kB in 1s (750 kB/s)   
Selecting previously unselected package zfs-fuse.
(Reading database ... 248708 files and directories currently installed.)
Unpacking zfs-fuse (from .../zfs-fuse_0.7.0-10.1_amd64.deb) ...
Processing triggers for ureadahead ...
Processing triggers for man-db ...
Setting up zfs-fuse (0.7.0-10.1) ...
 * Starting zfs-fuse zfs-fuse                                                                                               [ OK ] 
 * Immunizing zfs-fuse against OOM kills and sendsigs signals...                                                            [ OK ] 
 * Mounting ZFS filesystems...                                                                                              [ OK ] 
Processing triggers for ureadahead ...

$ sudo zpool list
no pools available

$ dd if=/dev/zero of=/home/dustin/zfs1.part bs=1M count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 0.0588473 s, 1.1 GB/s

$ sudo zpool create zfs_test /home/dustin/zfs1.part 

$ sudo zpool list
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zfs_test  59.5M    94K  59.4M     0%  1.00x  ONLINE  -

$ sudo dd if=/dev/zero of=/zfs_test/dummy_file bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.3918 s, 7.5 MB/s

$ ls -l /zfs_test/
total 9988
-rw-r--r-- 1 root root 10485760 Mar  7 21:51 dummy_file

$ sudo zpool list
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zfs_test  59.5M  10.2M  49.3M    17%  1.00x  ONLINE  -

$ sudo zpool status zfs_test
  pool: zfs_test
 state: ONLINE
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	zfs_test                  ONLINE       0     0     0
	  /home/dustin/zfs1.part  ONLINE       0     0     0

errors: No known data errors

So, now we have one pool with one disk. However, ZFS also allows hot reconfiguration. Add (stripe) another disk to the pool:

$ dd if=/dev/zero of=/home/dustin/zfs2.part bs=1M count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 0.0571095 s, 1.2 GB/s

$ sudo zpool add zfs_test /home/dustin/zfs2.part 
$ sudo zpool status zfs_test
  pool: zfs_test
 state: ONLINE
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	zfs_test                  ONLINE       0     0     0
	  /home/dustin/zfs1.part  ONLINE       0     0     0
	  /home/dustin/zfs2.part  ONLINE       0     0     0

errors: No known data errors

$ sudo dd if=/dev/zero of=/zfs_test/dummy_file2 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 12.4728 s, 5.9 MB/s

$ sudo zpool list
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zfs_test   119M  80.3M  38.7M    67%  1.00x  ONLINE  -

I should mention that there is some diskspace overhead, or, at least, some need for explicitly optimizing the disk (if possible). Though I assigned two 64M “disks” to the pool, I received “out of space” errors when I first wrote a 10M file and then attempted to write a 80M file. It was successful when I chose to write a 70M file, instead.

You can also view IO stats:

$ sudo zpool iostat -v zfs_test
                             capacity     operations    bandwidth
pool                      alloc   free   read  write   read  write
------------------------  -----  -----  -----  -----  -----  -----
zfs_test                  80.5M  38.5M      0     11    127   110K
  /home/dustin/zfs1.part  40.4M  19.1M      0      6    100  56.3K
  /home/dustin/zfs2.part  40.1M  19.4M      0      5     32  63.0K
------------------------  -----  -----  -----  -----  -----  -----

For further usage examples, look at these tutorials:

Using AUFS to Combine Directories (With AWESOME Benefits)

A stackable or unification filesystem (referred to as a “union” filesystem) is one that combines the contents of many directories into a single directory. Junjiro Okajima’s AUFS (“aufs-tools”, under Ubuntu) is such an FS. However, there are some neat attributes that tools, such as Docker, take advantage of, and this is where it really gets interesting. I’ll discuss only one such feature, here.

AUFS has the concept of “branches”, where each branch is one directory to be combined. In addition, each branch has permissions imposed upon them: essentially “read-only” or “read-write”. By default, the first branch is read-write, and all others are read-only. With this as the foundation, AUFS presents a single, stacked filesystem, but begins to impose special handling on what can be modified, internally, while providing traditional filesystem behavior, externally.

When a delete is performed on a read-only branch, AUFS performs a “whiteout”, where the readonly director[y,ies] are untouched, but hidden files are created in the writable director[y,ies] to record the changes. Similar tracking, along with any, actual, new files, occurs when any change is applied to read-only directories. This also incorporates “copy on write” functionality, where copies of files are made only by necessity, and on demand.

$ mkdir /tmp/dir_a
$ mkdir /tmp/dir_b
$ mkdir /tmp/dir_c
$ touch /tmp/dir_a/file_a
$ touch /tmp/dir_b/file_b
$ touch /tmp/dir_c/file_c

$ sudo mount -t aufs -o br=/tmp/dir_a:/tmp/dir_b:/tmp/dir_c none /mnt/aufs_test/

$ ls -l /mnt/aufs_test/
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_a
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_b
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_c

$ ls -l /tmp/dir_c
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_c

$ touch /mnt/aufs_test/new_file_in_unwritable

$ ls -l /tmp/dir_c
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_c

$ ls -l /tmp/dir_a
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_a
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:33 new_file_in_unwritable

$ rm /mnt/aufs_test/file_c

$ ls -l /tmp/dir_c
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_c

$ ls -l /tmp/dir_a
total 0
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:31 file_a
-rw-r--r-- 1 dustin dustin 0 Feb 11 23:33 new_file_in_unwritable

$ ls -la /tmp/dir_a
total 16
drwxr-xr-x  4 dustin dustin 4096 Feb 11 23:35 .
drwxrwxrwt 17 root   root   4096 Feb 11 23:35 ..
-rw-r--r--  1 dustin dustin    0 Feb 11 23:31 file_a
-rw-r--r--  1 dustin dustin    0 Feb 11 23:33 new_file_in_unwritable
-r--r--r--  2 root   root      0 Feb 11 23:31 .wh.file_c
-r--r--r--  2 root   root      0 Feb 11 23:31 .wh..wh.aufs
drwx------  2 root   root   4096 Feb 11 23:31 .wh..wh.orph
drwx------  2 root   root   4096 Feb 11 23:31 .wh..wh.plnk

Notice that we use mount rather than mount.aufs. It should also be mentioned that some filesystems on which the subordinate directories might be hosted could be problematic if they’re prone to bizarre behavior. cramfs, for example, is specifically mentioned in the manpage to have certain cases under which its behavior might be considered to be undefined.

AUFS seems to especially lend itself to process-containers.

For more information, visit the homepage and Okajima’s original announcement (2008).