Initial GFF parser for Biopython

Brad Chapman bio photo By Brad Chapman Comment

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I'm continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I'd be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.

Implementation

GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you've picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

[sourcecode language="python"]
from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
gff_types = [('Coding_transcript', 'gene'),
('Coding_transcript', 'mRNA'),
('Coding_transcript', 'CDS')],
gff_id = ['I']
)
with open(gff_file) as gff_handle:
feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']
[/sourcecode]

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.

Testing

The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.

BioSQL

To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.

Summary

This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

comments powered by Disqus