Exploring BioPerl GenBank to GFF mapping

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

Parsed relatively easily.
Human readable and editable in Excel.
Quickly understood at a basic level.
Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this:

bp_genbank2gff3.pl -out stdout cbx8.gb > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;
gene_synonym=MGC147589

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /gene="cbx8"
                     /gene_synonym="MGC147589"
                     /note="chromobox homolog 8"
                     /db_xref="GeneID:779897"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank	BioSQL table	Current BioPerl GFF	Proposed GFF key/value
LOCUS; identifier ACCESSION	bioentry name	ID
LOCUS; Molecule type	biosequence alphabet	mol_type
LOCUS; division	bioentry division		division
LOCUS; date	bioentry_qualifer_value term date_changed	date
DEFINITION	bioentry description	Note, but combined with COMMENT	description
VERSION	bioentry accession and version		hasVersion
GI	bioentry identifier		identifier
KEYWORDS	bioentry_qualifer_value term keywords		subject
SOURCE and ORGANISM	taxon	organism and Dbxref to taxon ID	Full lineage needs representation as well
REFERENCE	reference		Dbxref for PubMed IDs; need to store full reference information as well
COMMENT	comment	comment1 and Note, combined with DEFINITION	comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

Blue Collar Bioinformatics

Community built tools for biological data analysis

Exploring BioPerl GenBank to GFF mapping