Examining and adjusting your GFF file

Brad Chapman bio photo By Brad Chapman Comment

Generic Feature Format (GFF) defines a standard template for representing biological features. Within this template, however, is room for flexibility. Different GFF producers may decide to format their data in slightly different ways. While parsing these files is not a problem, correctly interpreting and utilizing the data can be. The in-development Biopython GFF parser provides utilities to get a high level summary of the elements of a GFF file, and to adjust line items in the file during parsing.

Examining a GFF file

When downloading a new GFF file, the first step is getting an overview of the file contents. The GFF parser provides a GFFExaminer class to help with this. The first function of interest is available_limits:

[sourcecode language="python"]
import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)
pprint.pprint(possible_limits)
[/sourcecode]

It returns a dictionary defining various ways to limit your parsing of the GFF file, along with the count of each item. As an example, here is a trimmed dump from one of the test files. This has features on two different record ids -- chromosomes 'I' and 'X', with 159 and 6 items, respectively. Also listed are the combination of the 2nd source column and 3rd type column, and the type column by itself:

{'gff_id': {('I',): 159,
            ('X',): 6},
 'gff_source_type': {('Coding_transcript', 'CDS'): 27,
                     ('Coding_transcript', 'exon'): 33,
                     ('Coding_transcript', 'five_prime_UTR'): 4,
                     ('Coding_transcript', 'gene'): 2,
                     ('Coding_transcript', 'intron'): 29,
                     ('Coding_transcript', 'mRNA'): 4,
                     ('Coding_transcript', 'three_prime_UTR'): 3,
                     ('mass_spec_genome', 'translated_nucleotide_match'): 7},
 'gff_type': {('CDS',): 57,
              ('exon',): 33,
              ('five_prime_UTR',): 4,
              ('gene',): 2,
              ('intron',): 29,
              ('mRNA',): 4,
              ('three_prime_UTR',): 3,
              ('translated_nucleotide_match',): 7}}

In addition to the overview of file contents, nested relationships are another important component of the file to understand. A summary of these is available through the parent_child_map function:

[sourcecode language="python"]
import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)
pprint.pprint(pc_map)
[/sourcecode]

Again, a dictionary is returned. The keys in the dictionary are parent source and type elements in the file, while the values are children of those elements. For instance, here is the dictionary for a three tiered relationship where genes have mRNAs, and each mRNA can have coding regions, exons, introns, and 5' and 3' untranslated regions:

{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}

Adjusting GFF lines during parsing

Occasionally you may run into a file format that you can comprehend, but that does not match your expectations for where items should be. SOLiD GFF files are one example; many thanks are due to David Schruth, who has been patiently working with me on the parsing of these files. They have the read name in the first column, which is normally used for the record ID the feature maps to. The actual record ID is contained as an attribute, i=1, where the 1 corresponds to the index of record it maps to in the original FASTA alignment file:

3_336_815_F3    solid   read    55409   55428   10.4    +       .       i=1

The GFFAddingIterator has an optional attribute, line_adjust_fn, which can be used to solve this problem. The function is called each time a line is read and passed a parsed dictionary representing the line. The dictionary for the above line is:

{'id': '',
 'is_gff2': False,
 'location': [55408, 55428],
 'quals': {
           'i': ['1'],
           'score': ['10.4'],
           'source': ['solid']},
 'rec_id': '3_336_815_F3',
 'strand': 1,
 'type': 'read'}

The function takes this item as an argument and returns the dictionary, but with any adjustments that need to be made. In our example, we look up the name of the record corresponding to the i=1 index. This name is swapped in for the rec_id, while that information moves to an attribute named read_name:

[sourcecode language="python"]
from Bio import SeqIO
from BCBio.GFF.GFFParser import GFFAddingIterator

class SolidFastaRemap:
def __init__(self, initial_fasta):
self._in_map = self._get_index_map(initial_fasta)

def _get_index_map(self, in_file):
in_map = dict()
in_handle = open(in_file)
for index, rec in enumerate(SeqIO.parse(in_handle, "fasta")):
in_map[index] = rec.id
in_handle.close()
return in_map

def adjust_fn(self, results):
# 1-based indexes; convert to 0-based
rec_index = int(results['quals']['i'][0]) - 1
read_name = results['rec_id']
results['quals']['read_name'] = [read_name]
results['rec_id'] = self._in_map[rec_index]
return results

remapper = SolidFastaRemap(fasta_file)
gff_iterator = GFFAddingIterator(line_adjust_fn=remapper.adjust_fn)
rec_dict = gff_iterator.get_all_features(gff_file)
[/sourcecode]

This allows you to fix the file during the parsing, saving multiple passes through the file. This general functionality can be used to cleanly deal with any other inconsistencies that crop up during your GFF parsing adventures.

comments powered by Disqus