Extracting protein characteristics for an InterPro domain

The InterPro database provides a common language and organization for characterized protein domains. Here, the goal is to extract all proteins containing a domain of interest, limit these proteins to those in the animal (Metazoa) kingdom, and extract information about the resulting proteins. Protein information will be retrieved from the UniProt database.

The first step is identifying the proteins of interest with a given domain. An example is the chromo domain motif with InterPro identifier IPR000953. InterPro provides a useful REST interface, allowing us to download the full list of related proteins in FASTA format and parse them using the Biopython SeqIO interface. For our example, this provides over 2500 records; here is a snippet of the class function that does this work.

[sourcecode language="python"]
def get_interpro_records(self, ipr_number):
url_base = "%s/interpro/ISpy?ipr=%s&mode=fasta"
full_url = url_base % (self._server, ipr_number)
recs = []
with self._get_open_handle(full_url) as in_handle:
for rec in SeqIO.parse(in_handle, "fasta"):
recs.append(rec)
return recs
[/sourcecode]

UniProt provides an excellent REST interface and XML format which help simplify the retrieval process. Parsing the XML records with the python ElementTree parser allows us to quickly pull out the organism name and evolutionary lineage.

[sourcecode language="python"]
import xml.etree.ElementTree as ET

def get_xml_metadata(self, uniprot_id):
url_base = "%s/uniprot/%s.xml"
full_url = url_base % (self._server, uniprot_id)
metadata = {}
with self._get_open_handle(full_url) as in_handle:
root = ET.parse(in_handle).getroot()
org = root.find("%sentry/%sorganism" % (self._xml_ns, self._xml_ns))
for org_node in org:
if org_node.tag == "%sname" % self._xml_ns:
if org_node.attrib["type"] == "scientific":
metadata["org_scientific_name"] = org_node.text
elif org_node.attrib["type"] == "common":
metadata["org_common_name"] = org_node.text
elif org_node.tag == "%slineage" % self._xml_ns:
metadata["org_lineage"] = [n.text for n in org_node]
return metadata
[/sourcecode]

Putting all the parts together, we loop over the list of Biopython sequence record objects, extract the UniProt ID, retrieve the metadata from UniProt, and store this in a local python shelve database:

[sourcecode language="python"]
cache_dir = os.path.join(os.getcwd(), "cache")
db_dir = os.path.join(os.getcwd(), "db")
interpro_retriever = InterproRestRetrieval(cache_dir)
uniprot_retriever = UniprotRestRetrieval(cache_dir)
cur_db = shelve.open(os.path.join(db_dir, ipr_number))
seq_recs = interpro_retriever.get_interpro_records(ipr_number)
for seq_rec in seq_recs:
uniprot_id = seq_rec.id.split("|")[0]
metadata = uniprot_retriever.get_xml_metadata(uniprot_id)
if (metadata.has_key("org_lineage") and
"Metazoa" in metadata["org_lineage"]):
metadata["seq"] = seq_rec.seq.data
cur_db[uniprot_id] = metadata
cur_db.close()
[/sourcecode]

The data is stored as a dictionary attached to the individual UniProt identifiers. This is modeled after the boto SimpleDB library, which provides a python interface to storage in Amazon's SimpleDB.

All of these code snippets in action can be found in this example script, which helps place the code sections above in context. In future weeks, we will try and pull some interesting information from the protein domain families.

Blue Collar Bioinformatics

Community built tools for biological data analysis

Extracting protein characteristics for an InterPro domain