Date of Thesis



Proteins' function and structure are intrinsically related. In order to understand proteins' functionality, it is essential for medical and biological researchers to deter- mine proteins' three-dimensional structure. The traditional method using NMR spectroscopy or X-ray crystallography are inefficient compared to computational methods. Fortunately, substantial progress has been made in the prediction of protein structure in bioinformatics. Despite these achievements, the computational complexity of protein folding remains a challenge. Instead, many methods aim to predict a protein contact map from protein sequence using machine learning algorithms. In this thesis, we introduce a novel ensemble method for protein contact map prediction based on bagging multiple decision trees. A random sampling method is used to address the large class imbalance in contact maps. To generalize the feature space, we further clustered the amino acid alphabet from twenty to ten. A software is also developed to view protein contact map at certain threshold and separation. The parameters used in decision trees are determined experimentally, and the overall results for the first L, L/2 and L/5 predictions for protein of length L are evaluated.


Machine learning, Data mining, Protein contact map, Decision trees, Bagging

Access Type

Honors Thesis

Degree Type

Bachelor of Science


Computer Science

Second Major


First Advisor

Brian R. King

Second Advisor

Luiz Felipe Perrone