Exploring BioPerl GenBank to GFF mapping

Brad Chapman bio photo By Brad Chapman Comment

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

  • Parsed relatively easily.
  • Human readable and editable in Excel.
  • Quickly understood at a basic level.
  • Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
  • Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this:


bp_genbank2gff3.pl -out stdout cbx8.gb > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;
gene_synonym=MGC147589

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /gene="cbx8"
                     /gene_synonym="MGC147589"
                     /note="chromobox homolog 8"
                     /db_xref="GeneID:779897"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank BioSQL table Current BioPerl GFF Proposed GFF key/value
LOCUS; identifier ACCESSION
bioentry

name

ID  
LOCUS; Molecule type
biosequence

alphabet

mol_type  
LOCUS; division
bioentry

division

  division
LOCUS; date
bioentry_qualifer_value

term date_changed

date  
DEFINITION
bioentry

description

Note, but combined with COMMENT description
VERSION
bioentry

accession and version

  hasVersion
GI
bioentry

identifier

  identifier
KEYWORDS
bioentry_qualifer_value

term keywords

  subject
SOURCE and ORGANISM
taxon
organism and Dbxref to taxon ID Full lineage needs representation as well
REFERENCE
reference
  Dbxref for PubMed IDs; need to store full reference information as well
COMMENT
comment
comment1 and Note, combined with DEFINITION comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

comments powered by Disqus