Comparative genomics information retrieval from Ensembl

Brad Chapman bio photo By Brad Chapman Comment

The Ensembl website provides a powerful front end to genomic data from a wide variety of eukaryotic species. Additionally, the Comparative Genomics initiative provides automated phylogenetic analyses for comprehensive examination of gene families. This post describes retrieving comparative protein details for a human gene of interest using the tried and true method of screen scraping, presenting the data in a method ready to be presented via a REST interface.

Several other people have had similar notions for retrieving Ensembl data. Pedro describes an example using openKapow, and Andrew uses Dapper.

Here, we deal with the Ensembl web pages using Beautiful Soup, a Python web parser that simplifies the work of web page retrieval. The idea is to generate an interface that could be readily abstracted to a set of REST web pages. This would greatly simplify information retrieval from Ensembl; retrieving a set of orthologs for a gene ID would involve a workflow like:

  • Prepare a URL like:
    http://example.ensembl.org/REST/Compara_Ortholog/Homo_sapiens/ENSG00000173894
  • Parse out a simple text result in CSV:
    Cavia_porcellus,ENSCPOG00000005599
    Danio_rerio,ENSDARG00000044938
    Dasypus_novemcinctus,ENSDNOG00000002668
    

For queries that can be expressed with a few inputs and readily understandable outputs, this would provide programmatic access to Ensembl data without the overhead of installing the Perl API. Below is a function which retrieves the pages, parses them with Beautiful Soup, and returns the simplified information. To wrap this into a REST interface described above, would require adding a layer on top using a Python web framework like Pylons.

[sourcecode language="python"]
def orthologs(self, organism, gene_id):
"""Retrieve a list of orthologs for the given gene ID.
"""
orthologs = []
with self._get_open_handle("Gene", "Compara_Ortholog",
organism, gene_id) as in_handle:
soup = BeautifulSoup(in_handle)
orth_table = soup.find("table", "orthologues")
orth_links = orth_table.findAll("a",
href = re.compile("Gene/Summary"))
for orth_link in orth_links:
href_parts = [x for x in orth_link['href'].split('/') if x]
orthologs.append((href_parts[0], orth_link.string))
return orthologs
[/sourcecode]

The full example script takes an organism and gene ID as input and displays homologs in Ensembl along with distance from the initial organism, protein domains, and features of the protein. The script uses Biopython, the Newick Tree Parser to parse phylogenetic trees, and NetworkX to calculate phylogenetic distances from the tree.

> python2.5 ensembl_remote_rest.py Homo_sapiens ENSG00000173894
Homo_sapiens 0 [u'IPR016197', u'IPR000953'] 38.0
Pan_troglodytes 0.009 [u'IPR016197', u'IPR000637', u'IPR000953'] 38.0
Gorilla_gorilla 0.0169 [u'IPR016197', u'IPR000637', u'IPR000953'] 
Macaca_mulatta 0.0538 [u'IPR000637', u'IPR000953'] 36.0
Tarsius_syrichta 0.1622 [u'IPR000637'] 
Microcebus_murinus 0.1848 [u'IPR000637', u'IPR000953'] 33.5

This demonstrates using the framework to look at the change in domains and protein charge across a gene family. The general idea of the EnsemblComparaRest< class could be applied to other information of interest available from Ensembl web pages.

comments powered by Disqus