A procedure for the automatic determination of hydrophobic
Download
Report
Transcript A procedure for the automatic determination of hydrophobic
Ubiquitination Sites
Prediction
Dah Mee Ko
Advisor: Dr.Predrag Radivojac
School of Informatics
Indiana University
May 22, 2009
Outline
Ubiquitination
Machine Learning
Decision Tree
Support Vector Machines
Prediction of ubiquitination sites
Influence of sequence
Influence of structure
Influence of evolutionary consideration
Ubiquitin
A small protein that occurs in all eukaryotic cells.
Highly conserved among eukaryotic species.
Consists of 76 amino acids and has a molecular mass of
8.5 kDa.
Key features
its C-terminal tail and Lys residues
Human ubiquitin sequence
http://en.wikipedia.org/wiki/Image:Ubiquitin_cartoon.png
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIF
AGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
Ubiquitination
Post-translational modification of a protein
Covalent attachment of one or more ubiquitin monomers
to Lys residues
Reversible
Target proteins for degradation by the proteasome
Functions of Ubiquitination
Monoubiquitination
Histone regulation
DNA repair
Endocytosis
Budding of retroviruses from the plasma membrane
Polyubiquitination
Protein kinase activation
Machine Learning
Machine learning is programming computers to
optimize a performance criterion using data and
past experience.
Learn general models from a data set of particular
examples.
Build a model that is a good and useful approximation
to the data.
Machine Learning
Supervised learning
Learn input/output patterns from given, correct output.
Split data into training and test set.
Train element on training data.
Evaluate performance on test data.
Unsupervised learning
Learn input/output patterns without known output.
Machine Learning – Decision Tree
One of classification algorithms
X
Y
Outlook
Humidity
Wind
Play
Sunny
High
Weak
No
Sunny
Normal
Weak
Yes
Overcast
Normal
Weak
Yes
Rain
Normal
Strong
No
Rain
Normal
Weak
Yes
Each internal node tests the value of a feature and branches
according to the results of the test.
Each leaf node assigns a classification.
Machine Learning – Random Forest
A machine learning ensemble classifier
Consists of many decision trees
Each tree is constructed using a bootstrap
sample of training data.
After a large number of trees are generated,
each tree casts a unit vote for the most popular
class.
Machine Learning – Support Vector
Machines
Viewing input data as two sets of vectors in an
n-dimensional space, an support vector machine
will construct a separating hyperplane in that
space.
The hyperplane maximizes the margin between
the two data sets.
Machine Learning – Support Vector
Machines
H3 does not separate the classes.
H1 separates with a small margin.
H2 separates with the maximum
margin.
If a data set is not linearly separable, map into a
higher-dimensional space using kernel approach.
Data Sets for Prediction
334 protein sequences from yeast
Positive and negative sites with 25 amino acid
residues centered at lysine
Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S
Remove all positive and negative sites that have
more than 40% identity inside the data sets.
Features – Sequence Information
Relative amino acid frequencies, Entropy, Net charge, Total charge,
Aromatics, Charge-hydrophobicity ratio, Protein disorder probability,
Vihinen's flexibility, Hydrophobic moments, B-factors
64 X 4 = 256 features
Relative amino acid frequencies
Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S
Window size = 11
A = 1/11 G = 0/11
M = 0/11
S = 0/11
C = 0/11
H = 0/11
N = 1/11
T = 1/11
D = 2/11
I = 0/11P = 3/11
V = 1/11
E = 0/11
K = 1/11
Q = 0/11
W = 0/11
F = 0/11
L = 0/11R =0/11
Y = 1/11
Features – Evolutionary Information
Position Specific Scoring Matrix
21 X 4 = 84 features
Window size = 11
Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S
256(Seq) + 84(Evol) = 340 features.
Features – Structure Information
BLAST each sequence against PDB database.
Select alignments with greater than 30% identity.
For each mapped site, five shells with 1.5, 3, 4.5, 6, 7,5Å radial boundaries
are constructed from the residue’s alpha-carbon atom using X, Y, Z
coordinates from PDB.
Amino acid at the center site 20 features
e.g. K A C D E F G H I K L M N P Q R S T V W Y
000 000 00100 0 00 0000 0 0
Each shell contains 24 features.
4 for counts of C, N, O, S and 20 for counts of amino acids
20 + 24 x 5 = 140 features
60 sites among 245 positive sites ~24%
3239 sites among 12906 negative sites ~25%
1X140 zero vector for the other sites
256(Seq) + 84(Evol) + 140(Str) = 480 features
Prediction Results – Random Forest
Features
Accuracy
AUC
Seq + Evol + Str
65.2 +/- 22.8
71.5 +/- 25.3
Seq + Evol
63.9 +/- 23.4
69.8 +/- 24.9
Seq + Str
66.2 +/- 22.3
70.6 +/- 24.3
Evol + Str
56.7 +/- 23.1
59.2 +/- 27.7
Seq
64.6 +/- 22.4
70.1 +/- 24.3
Prediction Results – Random Forest
1
True Positive Rate
0.8
0.6
0.4
Seq + Evol + Str
Seq + Evol
Seq + Str
Evol + Str
Seq
0.2
0
0
0.2
0.4
0.6
False Positive Rate
AUC = 71.6
AUC = 71.0
AUC = 71.1
AUC = 60.4
AUC = 70.5
0.8
1
Prediction Results – SVM
Features
Accuracy
AUC
Seq + Evol + Str
63.8 +/- 23.2
71.2 +/- 25.3
Seq + Evol
63.7 +/- 23.4
71.0 +/- 25.4
Seq + Str
65.1 +/- 23.5
71.3 +/- 25.0
Evol + Str
56.6 +/- 22.2
59.9 +/- 28.0
Seq
65.8 +/- 23.0
71.2 +/- 25.0
Prediction Results – SVM
1
0.9
True Positive Rate
0.8
0.7
0.6
0.5
0.4
Seq + Evol + Str AUC = 70.0
Seq + Evol
AUC = 69,9
Seq + Str
AUC = 69.7
Evol + Str
AUC = 59.8
Seq
AUC = 69.6
0.3
0.2
0.1
0
0
0.2
0.4
0.6
False Positive Rate
0.8
1
Feature Selection
Rank features using correlation coefficients (r).
r
-0.0790
Feature
Net charge
-0.0585
0.0530
0.0513
0.0481
K frequency
E frequency
D frequency
Predicted B-factor
0.0471
0.0448
-0.0387
Protein disorder
Vihinen's flexibility
Hydrophobic moment
-0.0383
L frequency
Conclusions
Ubiquitination sites are predictable.
The accuracy is modest.
Long range interactions
Flexibility of structure
Noise in positive sites
Small data set
The sequence features are the most important.
Acknowledgements
Prof. Predrag Radivojac
Wyatt Clark
Arunima Ram
Nils Schimmelmann
Prof. Sun Kim
Linda Hostetter
School of Informatics
Thank you!