Automated protein conservation display from BLAST alignments

Pawel at Freelancing Science had an excellent post about making structural predictions from visual analysis of BLAST alignments. This concept inspired me to put together an automated solution to visualize protein conservation. Starting with a protein of interest it will retrieve the relevant identifier from NCBI, do a remote BLAST, examine the alignments to related divergent proteins, calculate conservation and display a plot of the results.

With a protein identifier like a GenBank accession number or UniProt ID, we first need to find the standard NCBI GI number. Using Biopython's Entrez module:

[sourcecode language="python"]
def search_for_gi(self, uniprot_id, db_name):
handle = Entrez.esearch(db=db_name, term=uniprot_id)
record = Entrez.read(handle)
ids = record["IdList"]
if len(ids) == 0:
raise ValueError("Not found in NCBI: %s" % ids)
return ids[0]
[/sourcecode]

Given the GI, a remote BLAST is performed and the XML result is parsed into a record object. This is again done using Biopython libraries, with the BLAST result cached in case of re-runs. If you were using this code for many queries or on a web page presented to scientists, it would make sense to use a local BLAST for speed purposes. This could easily be plugged in, again using Biopython. Here is the remote version:

[sourcecode language="python"]
def remote_blast(self, search_gi, blast_method):
out_file = os.path.join(self._cache_dir, "%s_%s_blo.xml" % (blast_method,
search_gi))
if not os.path.exists(out_file):
blast_handle = NCBIWWW.qblast(blast_method, "nr", search_gi)
with open(out_file, 'w') as out_handle:
for line in blast_handle:
out_handle.write(line)
blast_handle.close()
with open(out_file) as in_handle:
rec_it = NCBIXML.parse(in_handle)
return rec_it.next()
[/sourcecode]

With the parsed record, the next step is to loop over the alignments to calculate conservation across the protein. To provide quantitative results, a protein substitution matrix provides a score for each BLAST alignment character pair. Higher scores indicate a more conserved alignment, with exact matches being the highest scores. We use the BLOSUM62 matrix here, although a wide variety are supported by Biopython. The class below loops through all of the alignments and high scoring pairs (HSP, in BLAST nomenclature), notes the position, and uses the alignment pairs and matrix to assign conservation scores at each position:

[sourcecode language="python"]
class BlastConservationCalculator:
def __init__(self, matrix_name="blosum62"):
self._subs_mat = getattr(MatrixInfo, matrix_name)
self._no_use_thresh = 0.95

def conservation_dict(self, blast_rec):
cons_dict = {}
rec_size = int(blast_rec.query_letters)
for base_index in range(rec_size):
cons_dict[base_index] = []
for align in blast_rec.alignments:
for hsp in align.hsps:
if (float(hsp.identities) / float(rec_size) <=
self._no_use_thresh):
cons_dict = self._add_hsp_conservation(hsp, cons_dict)
return cons_dict

def _add_hsp_conservation(self, hsp, cons_dict):
start_index = int(hsp.query_start) - 1
hsp_index = 0
for q_index in range(len(hsp.query)):
if (hsp.query[q_index] != '-'):
if (hsp.sbjct[q_index] != '-'):
try:
sub_val = self._subs_mat[(hsp.query[q_index],
hsp.sbjct[q_index])]
except KeyError:
sub_val = self._subs_mat[(hsp.sbjct[q_index],
hsp.query[q_index])]
cons_dict[start_index + hsp_index].append(sub_val)
hsp_index += 1
return cons_dict
[/sourcecode]

The result of this work is a dictionary of score conservation at each position. If you plot the average of these scores directly, it results in a very choppy graph which is hard to interpret. Luckily, Andrew Dalke has tackled this problem and presented a detailed writeup of smoothing scores for plotting. Using the Savitzky-Golay technique described there, the smoothed average of the results are plotted using matplotlib:

[sourcecode language="python"]
window_size = 29
data_smoother = SavitzkyGolayDataSmoother(window_size)
pos_data = []
cons_data = []
for pos in indexes:
pos_data.append(pos + 1)
if len(cons_dict[pos]) > 0:
cons_data.append(numpy.median(cons_dict[pos]))
else:
cons_dict.append(0)
smooth_data = data_smoother.smooth_values(cons_data)
smooth_pos_data = pos_data[data_smoother.half_window():
len(pos_data) - data_smoother.half_window()]
pylab.plot(smooth_pos_data, smooth_data)
pylab.axis(xmin=min(pos_data), xmax=max(pos_data))
pylab.xlabel("Amino acid position")
pylab.ylabel("Conservation")
pylab.savefig('%s_conservation.png' % accession.replace(".", "_"))
[/sourcecode]

The resulting plot was prepared for the example from Pawel's post that inspired all this and is shown below. We can see the 4 regions of less conservation noted by Pawel by inspection of the alignment, along with the 3 intervening peaks of conservation:

[caption id="attachment_65" align="alignnone" width="400" caption=""] Example conservation plot [/caption]

The full script puts all of these parts together into a working version that could be used for your own proteins of interest. These plots are useful for getting a quick overview of protein conservation when working with a new protein. They could be used to compare with known regions like domains, to identify likely regions of importance for protein deletion studies, or to make structure evaluations. The intriguing aspect of the plots is the opportunity to quickly evaluate and make predictions for experimental study.

Blue Collar Bioinformatics

Community built tools for biological data analysis

Automated protein conservation display from BLAST alignments