Python GFF parser update -- parallel parsing and GFF2

Parallel parsing

Last week we discussed refactoring the Python GFF parser to use a MapReduce framework. This was designed with the idea of being able to scale GFF parsing as file size increases. In addition to large files describing genome annotations, GFF is spreading to next-generation sequencing; SOLiD provides a tool to convert their mapping files to GFF.

Parallel processing introduces overhead due to software intermediates and networking costs. For the Disco implementation of GFF parsing, parsed lines run through Erlang and are translated to and from JSON strings. Invoking this overhead is worthwhile only if enough processors are utilized to overcome the slowdown. To estimate when we should start to parallelize, I looked at parsing a 1.5GB GFF file on a small multi-core machine and a remote cluster. Based on rough testing and non-scientific linear extrapolation of the results, I estimate 8 processors are needed to start to see a speed-up over local processing.

The starting baseline for parsing our 1.5GB file is one and half minutes using a single processor on my commodity Dell desktop. This desktop has 4 cores, and running Disco utilizing all 4 CPUs, the time increases to 3 minutes. Once Disco itself has been set up, switching between the two is seamless since the file is parsed in shared memory.

The advantage of utilizing Disco is that it can scale from this local implementation to very large clusters. Amazon's Elastic Computing Cloud (EC2) is an amazing resource where you can quickly set up and run jobs on powerful hardware. It is essentially an instant on-demand cluster for running applications. Using the ElasticFox Firefox plugin and the setup directions for Disco on EC2, I was able to quickly test GFF parsing on a test cluster of three small (AMI ami-cfbc58a6, a Debian 5.0 Lenny instance) instances. For distributed jobs, the main challenges are setting up each of the cluster nodes with the software, and distributing the files across the nodes. Disco provides scripts to install itself across the cluster and to distribute the file being parsed. When you are attacking a GFF parsing job that is prohibitively slow or memory intensive on your local hardware, a small cluster of a few extra-large of extra-large high CPU instances on EC2 will help you overcome these limitations. Hopefully in the future Disco will become available on some standard Amazon machine images, lowering the threshold to getting a job running.

In practical terms, local GFF parsing will be fine for most standard files. When you are limited by parsing time with large files, attack the problem using either a local cluster or EC2 with 8 or more processors. To better utilize a small number of local CPUs, it makes sense to explore a light weight solution such as the new python multiprocessing module.

GFF2 support

The initial target for GFF parsing was the GFF3 standard. However, many genome centers still use the older GFF2 or GTF formats. The main parsing difference between these formats are the attributes. In GFF3, they look like:

  ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2

while in GFF2 they are less standardized, and look like:

  Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2"

The parser has been updated to handle GFF2 attributes correctly, with test cases from several genome centers. In practice, there are several tricky implementations of the GFF2 specifications; if you find examples of incorrectly parsed attributes by the current parser, please pass them along.

GFF2 and GFF3 also differ in how nested features are handled. A standard example of nesting is specifying the coding regions of a transcript. Since GFF2 didn't provide a default way to do this, there are several different methods used in practice. Currently, the parser leaves these GFF2 features as flat and you would need to write custom code on top of the parser to nest them if desired.

The latest version of the GFF parsing code is available from GitHub. To install it, click the download link on that page and you will get the whole directory along with a setup.py file to install it. It installs outside of Biopython since it is still under development. As always, I am happy to accept any contributions or suggestions.

Blue Collar Bioinformatics

Community built tools for biological data analysis

Python GFF parser update -- parallel parsing and GFF2

Parallel parsing

GFF2 support