Finding proteins with related function using semantic clustering

Brad Chapman bio photo By Brad Chapman Comment

An immediate goal when investigating a protein of interest is to identify related proteins. There are several ways you can go about this:

  • Search for proteins by sequence similarity, using BLAST or other tools.
  • Search for proteins with similar characteristics, using features like InterPro domains.
  • Search for proteins determined to have similar functionality, using the literature.

Here we will automate the literature-functionality approach using Zemanta. Previously, we talked about how well Zemanta extracted biological information from gene descriptions. This can be used to classify UniProt descriptions and find proteins with similar functionality to a target of interest.

In this example, we will look for proteins similar to a Polycomb chromatin remodeling protein in Mouse. As our base, we build a query of all mouse proteins in UniProt classified as repressors: organism:"Mus musculus [10090]" AND keyword:"Repressor [678]". This provides 350 records, which can be downloaded using the UniProt web interface as a tab delimited file.

Next, Zemanta is used to extract keywords linked to either Wikipedia or Freebase. Starting with our UniProt XML retriever, we get the functional descriptions:

[sourcecode language="python"]
def get_description_terms(retriever, cur_id, api_key):
metadata = retriever.get_xml_metadata(cur_id)
if metadata.has_key("function_descr"):
keywords = zemanta_link_kws(metadata["function_descr"], api_key)
if len(keywords) > 0:
return keywords
return []
[/sourcecode]

These are fed into Zemanta, extracting the keywords:

[sourcecode language="python"]
def zemanta_link_kws(search_text, api_key):
gateway = 'http://api.zemanta.com/services/rest/0.0/'
args = {'method': 'zemanta.suggest',
'api_key': api_key,
'text': search_text,
'return_categories': 'dmoz',
'return_images': 0,
'return_rdf_links' : 1,
'format': 'json'}
args_enc = urllib.urlencode(args)
raw_output = urllib2.urlopen(gateway, args_enc).read()
output = simplejson.loads(raw_output)
link_kws = []
for link in output['markup']['links']:
for target in link['target']:
if target['type'] in ['wikipedia', 'rdf']:
link_kws.append(target['title'])
return list(set(link_kws))
[/sourcecode]

With a list of UniProt IDs and keywords organized into a dictionary we then build a binary matrix of terms for clustering. This approach is described in the excellent book Programming Collective Intelligence. The resulting matrix has UniProt IDs as rows and keyword terms as columns. Each value will be 1 if the term is contained in the Zemanta extracted keywords for that UniProt ID, or 0 otherwise:

[sourcecode language="python"]
def organize_term_array(cur_db):
all_terms = reduce(operator.add, cur_db.values())
term_counts = collections.defaultdict(lambda: 0)
for term in all_terms:
term_counts[term] += 1
all_terms = list(set(all_terms))
term_matrix = []
all_ids = []
for uniprot_id, cur_terms in cur_db.items():
cur_row = [(1 if t in cur_terms else 0) for t in all_terms]
term_matrix.append(cur_row)
all_ids.append(uniprot_id)
return numpy.array(term_matrix), all_ids
[/sourcecode]

Finally, this matrix is fed into the PyCluster module in Biopython for k-means clustering. The clusters are examined and information on the other IDs clustered with our target organism are printed:

[sourcecode language="python"]
cluster_ids, error, nfound = Cluster.kcluster(term_matrix,
nclusters=10, npass=20, method='a', dist='e')
cluster_dict = collections.defaultdict(lambda: [])
for i, cluster_id in enumerate(cluster_ids):
cluster_dict[cluster_id].append(uniprot_ids[i])
for cluster_group in cluster_dict.values():
if target_id in cluster_group:
for item in cluster_group:
print item, cur_db[item]
[/sourcecode]

The full script shows all of these parts tied together. For our example, Zemanta pulls out the following keyword list: ['Polycomb-group proteins', 'Homeotic gene', 'Histone', 'Lys (department)', 'Chromatin', 'Histone H2A'] and clustering identifies 19 similar proteins. This could be presented through a web interface into which a scientist enters a protein of interest and gets back the resulting list for manual inspection.

Conceptually, this automated approach is similar to an expert searching through the literature. Here, we are virtually clicking through Wikipedia links and noting similarities. By being able to leverage general purpose tools like Zemanta, we avoid having to build a science specific tool for this purpose. This is an additional argument for adoption of general tools like Wikipedia for scientific annotation.

comments powered by Disqus