MapReduce implementation of GFF parsing for Biopython

I previously wrote up details about starting a GFF parser for Biopython. In addition to incorporating suggestions received on the Biopython mailing list, it has been redesigned for parallel computation using Disco. Disco is an implementation of the distributed MapReduce framework in Erlang and Python. The code is available from the git repository; this post describes the impetus and design behind the MapReduce revision.

The scale of biological analyses is growing quickly thanks to new sequencing technologies. Bioinformatics programmers will need to learn techniques to rapidly analyze extremely large data sets. My coding toolbox has expanded to tackle these problems in two ways. The first is exploring programming languages designed for speed and parallelism, like Haskell. Additionally, I have been learning general techniques for parallelizing programs. Both require re-thinking code design to take advantage of increasingly available multi-processor and clustered architectures.

The MapReduce framework, originally proposed by Google, exemplifies the idea of redesigning code to analyze large data sets in parallel. In short, the programmer writes two functions: map and reduce. The map function handles the raw parsing work; for instance, it parses a line of GFF text and structures the details of interest. The reduce function combines the results from the map parsing, making them available for additional processing outside of the parallel part of the job. Several implementations of MapReduce have become popularly used. Apache's Hadoop is a mature Java implementation with an underlying distributed file system. Here we utilize Disco, an implementation in Erlang and Python from Nokia Research Center.

The MapReduce GFF parser consists of two standalone functions. The map function takes a line of GFF text and first determines if we should parse it based on a set of limits. This allows the user to only pull items of interest from the GFF file, saving memory and time:

[sourcecode language="python"]
def _gff_line_map(line, params):
strand_map = {'+' : 1, '-' : -1, '?' : None, None: None}
line = line.strip()
if line[0] != "#":
parts = line.split('\t')
should_do = True
if params.limit_info:
for limit_name, limit_values in params.limit_info.items():
cur_id = tuple([parts[i] for i in
params.filter_info[limit_name]])
if cur_id not in limit_values:
should_do = False
break
[/sourcecode]

If the GFF line is to be parsed, we use it to build a dictionary with all the details. Additionally, the line is classified as a top level annotation, a standard flat feature with a location, or part of a parent/child nested feature. The results are returned as a dictionary. For the disco parallel implementation, we use JSON to convert the dictionary into a flattened string:

[sourcecode language="python"]
if should_do:
assert len(parts) == 9, line
gff_parts = [(None if p == '.' else p) for p in parts]
gff_info = dict()
# collect all of the base qualifiers for this item
quals = collections.defaultdict(list)
if gff_parts[1]:
quals["source"].append(gff_parts[1])
if gff_parts[5]:
quals["score"].append(gff_parts[5])
if gff_parts[7]:
quals["phase"].append(gff_parts[7])
for key, val in [a.split('=') for a in gff_parts[8].split(';')]:
quals[key].extend(val.split(','))
gff_info['quals'] = dict(quals)
gff_info['rec_id'] = gff_parts[0]
# if we are describing a location, then we are a feature
if gff_parts[3] and gff_parts[4]:
gff_info['location'] = [int(gff_parts[3]) - 1,
int(gff_parts[4])]
gff_info['type'] = gff_parts[2]
gff_info['id'] = quals.get('ID', [''])[0]
gff_info['strand'] = strand_map[gff_parts[6]]
# Handle flat features
if not gff_info['id']:
final_key = 'feature'
# features that have parents need to link so we can pick up
# the relationship
elif gff_info['quals'].has_key('Parent'):
final_key = 'child'
# top level features
else:
final_key = 'parent'
# otherwise, associate these annotations with the full record
else:
final_key = 'annotation'
return [(final_key, (simplejson.dumps(gff_info) if params.jsonify
else gff_info))]
[/sourcecode]

The advantage of this distinct map function is that it can be run in parallel for any line in the file. To condense the results back into a synchronous world, the reduce function takes the results of the map function and combines them into a dictionary of results:

[sourcecode language="python"]
def _gff_line_reduce(map_results, out, params):
final_items = dict()
for gff_type, final_val in map_results:
send_val = (simplejson.loads(final_val) if params.jsonify else
final_val)
try:
final_items[gff_type].append(send_val)
except KeyError:
final_items[gff_type] = [send_val]
for key, vals in final_items.items():
out.add(key, (simplejson.dumps(vals) if params.jsonify else vals))
[/sourcecode]

Finally, the dictionaries of GFF information are converted into Biopython SeqFeatures and attached to SeqRecord objects; the standard object interface is identical to that used for GenBank feature files.

Re-writing the code sped it up by a roughly calculated 10% for single processor work. Splitting up parsing and object creation allowed me to apply some simple speed ups which contributed to this improvement. The hidden advantage of learning new programming frameworks is that it encourages you to think about familiar problems in different ways.

This implementation is designed to work both in parallel using Disco, and locally on a standard machine. Practically, this means that Disco is not required unless you care about parallelizing the code. Parallel coding may also not be the right approach for a particular problem. For small files, it is more efficient to run the code locally and avoid the overhead involved with making it parallel.

When you suddenly need to apply your small GFF parsing script to many gigabytes of result data the code will scale accordingly by incorporating Disco. To look at the practical numbers related to this scaling, I plan to follow up on this post with tests using Disco on Amazon's Elastic Compute Cloud.

Blue Collar Bioinformatics

Community built tools for biological data analysis

MapReduce implementation of GFF parsing for Biopython