Date of Thesis

5-7-2015

Thesis Type

Honors Thesis

Degree Type

Bachelor of Science

First Advisor

Brian R. King

Second Advisor

Luiz Felipe Perrone

Abstract

Proteins' function and structure are intrinsically related. In order to understand proteins' functionality, it is essential for medical and biological researchers to deter- mine proteins' three-dimensional structure. The traditional method using NMR spectroscopy or X-ray crystallography are inefficient compared to computational methods. Fortunately, substantial progress has been made in the prediction of protein structure in bioinformatics. Despite these achievements, the computational complexity of protein folding remains a challenge. Instead, many methods aim to predict a protein contact map from protein sequence using machine learning algorithms. In this thesis, we introduce a novel ensemble method for protein contact map prediction based on bagging multiple decision trees. A random sampling method is used to address the large class imbalance in contact maps. To generalize the feature space, we further clustered the amino acid alphabet from twenty to ten. A software is also developed to view protein contact map at certain threshold and separation. The parameters used in decision trees are determined experimentally, and the overall results for the first L, L/2 and L/5 predictions for protein of length L are evaluated.

Share

COinS