Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.
The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.
We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.
The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don't need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.
[sourcecode language="python"]
from bx.align import maf
from bx import interval_index_file
def build_index(in_file, index_file):
indexes = interval_index_file.Indexes()
with open(in_file) as in_handle:
reader = maf.Reader(in_handle)
while 1:
pos = reader.file.tell()
rec = reader.next()
if rec is None:
break
for c in rec.components:
indexes.add(c.src, c.forward_strand_start,
c.forward_strand_end, pos, max=c.src_size )
with open(index_file, "w") as index_handle:
indexes.write(index_handle)
[/sourcecode]
Now we have a unique index to retrieve a record -- the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.
[sourcecode language="python"]
rec_info = []
with open(in_file) as in_handle:
reader = maf.Reader(in_handle)
while 1:
pos = reader.file.tell()
rec = reader.next()
if rec is None:
break
rec_info.append((rec.text_size, pos))
rec_info.sort(reverse=True)
[/sourcecode]
Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.
[sourcecode language="python"]
index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
writer = maf.Writer(out_handle)
for size, pos in rec_info:
rec = index.get_at_offset(pos)
writer.write(rec)
[/sourcecode]
The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.
comments powered by Disqus