Graphing of variables before classification

Brad Chapman bio photo By Brad Chapman Comment

It is important to be able to graphically evaluate variables that feed into classification algorithms. In addition to assessing the distribution of the data, visual inspection also reveals patterns that can be later tested statistically. This post describes the preparation of a quick histogram for data from positive and negative examples feeding into a classifier. The excellent matplotlib library is used for visualization.

The example involves classifying two sets of proteins based on regional sequence charge. Two FASTA files were prepared, containing positive (active) and negative (non-active) examples. The goal is to look for a difference in charge between the two groups. Given a window size of amino acids to look at, we loop over the records in the file using the Biopython SeqIO interface:

[sourcecode language="python"]
def file_charges(in_file, cur_window):
all_charges = []
with open(in_file) as in_handle:
for rec in SeqIO.parse(in_handle, "fasta"):
cur_charges = calc_region_charges(rec.seq, cur_window)
all_charges.extend(cur_charges)
return all_charges
[/sourcecode]

We also use Biopython to calculate the Isoelectric point of each protein window. This is used as a proxy for charge; higher isoelectric points correspond to higher charges over the region.

[sourcecode language="python"]
def calc_region_charges(seq, cur_window):
# internal small regions, so do not deal with C and N terminal charges
IsoelectricPoint.pKcterminal = {}
IsoelectricPoint.pKnterminal = {}
cur_pos = 0
region_charges = []
while cur_pos < len(seq) - cur_window:
cur_seq = seq[cur_pos:cur_pos + cur_window]
prot_analysis = ProtParam.ProteinAnalysis(str(cur_seq))
ie_calc = IsoelectricPoint.IsoelectricPoint(cur_seq,
prot_analysis.count_amino_acids())
region_charges.append(ie_calc.pi())
cur_pos += 1
return region_charges
[/sourcecode]

With this in place, we get the charges for the example records and plot them side by side as a histogram using matplotlib:

[sourcecode language="python"]
cur_window = 75
pos_charges = file_charges(pos_file, cur_window)
neg_charges = file_charges(neg_file, cur_window)
n, bins, patches = pylab.hist([pos_charges, neg_charges], 30,
normed=True, histtype='bar')
pylab.xlabel('Isoelectric point')
pylab.ylabel('Normalized percent of regions')
pylab.title('Protein charge of %s amino acid windows' % cur_window)
pylab.legend([patches[0][0], patches[1][0]], ['positives', 'negatives'])
pylab.savefig('pos_neg_hist.png')
pylab.show()
[/sourcecode]

The resulting graph shows the different distribution of charge between the positive and negative records. At an isoelectric point just above 10, only sequence windows from the positive examples are found. Looking at the percentage of highly charged sequence regions above this threshold in non-classified sequences could serve as the basis for a classifier.

[caption id="attachment_46" align="alignnone" width="400" caption="Example histogram"]Example histogram[/caption]

The example script puts all of this together, and could be used as a template for classification problems in your own research.

comments powered by Disqus