A procedure for the automatic determination of hydrophobic

Download Report

Transcript A procedure for the automatic determination of hydrophobic

Ubiquitination Sites
Prediction
Dah Mee Ko
Advisor: Dr.Predrag Radivojac
School of Informatics
Indiana University
May 22, 2009
Outline

Ubiquitination

Machine Learning



Decision Tree
Support Vector Machines
Prediction of ubiquitination sites



Influence of sequence
Influence of structure
Influence of evolutionary consideration
Ubiquitin

A small protein that occurs in all eukaryotic cells.

Highly conserved among eukaryotic species.

Consists of 76 amino acids and has a molecular mass of
8.5 kDa.

Key features


its C-terminal tail and Lys residues
Human ubiquitin sequence

http://en.wikipedia.org/wiki/Image:Ubiquitin_cartoon.png
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIF
AGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
Ubiquitination

Post-translational modification of a protein
Covalent attachment of one or more ubiquitin monomers
to Lys residues
Reversible

Target proteins for degradation by the proteasome


Functions of Ubiquitination

Monoubiquitination





Histone regulation
DNA repair
Endocytosis
Budding of retroviruses from the plasma membrane
Polyubiquitination

Protein kinase activation
Machine Learning

Machine learning is programming computers to
optimize a performance criterion using data and
past experience.


Learn general models from a data set of particular
examples.
Build a model that is a good and useful approximation
to the data.
Machine Learning

Supervised learning


Learn input/output patterns from given, correct output.
Split data into training and test set.



Train element on training data.
Evaluate performance on test data.
Unsupervised learning

Learn input/output patterns without known output.
Machine Learning – Decision Tree

One of classification algorithms
X
Y
Outlook
Humidity
Wind
Play
Sunny
High
Weak
No
Sunny
Normal
Weak
Yes
Overcast
Normal
Weak
Yes
Rain
Normal
Strong
No
Rain
Normal
Weak
Yes


Each internal node tests the value of a feature and branches
according to the results of the test.
Each leaf node assigns a classification.
Machine Learning – Random Forest
A machine learning ensemble classifier
 Consists of many decision trees

Each tree is constructed using a bootstrap
sample of training data.
 After a large number of trees are generated,
each tree casts a unit vote for the most popular
class.

Machine Learning – Support Vector
Machines

Viewing input data as two sets of vectors in an
n-dimensional space, an support vector machine
will construct a separating hyperplane in that
space.

The hyperplane maximizes the margin between
the two data sets.
Machine Learning – Support Vector
Machines




H3 does not separate the classes.
H1 separates with a small margin.
H2 separates with the maximum
margin.
If a data set is not linearly separable, map into a
higher-dimensional space using kernel approach.
Data Sets for Prediction

334 protein sequences from yeast

Positive and negative sites with 25 amino acid
residues centered at lysine
Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S

Remove all positive and negative sites that have
more than 40% identity inside the data sets.
Features – Sequence Information
Relative amino acid frequencies, Entropy, Net charge, Total charge,
Aromatics, Charge-hydrophobicity ratio, Protein disorder probability,
Vihinen's flexibility, Hydrophobic moments, B-factors
 64 X 4 = 256 features
 Relative amino acid frequencies

Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S
Window size = 11
A = 1/11 G = 0/11
M = 0/11
S = 0/11
C = 0/11
H = 0/11
N = 1/11
T = 1/11
D = 2/11
I = 0/11P = 3/11
V = 1/11
E = 0/11
K = 1/11
Q = 0/11
W = 0/11
F = 0/11
L = 0/11R =0/11
Y = 1/11
Features – Evolutionary Information
Position Specific Scoring Matrix
 21 X 4 = 84 features

Window size = 11
Y E Y E Y DQ T D P V A K D P Y N P Y Y L D F A S

256(Seq) + 84(Evol) = 340 features.
Features – Structure Information






BLAST each sequence against PDB database.
Select alignments with greater than 30% identity.
For each mapped site, five shells with 1.5, 3, 4.5, 6, 7,5Å radial boundaries
are constructed from the residue’s alpha-carbon atom using X, Y, Z
coordinates from PDB.

Amino acid at the center site  20 features
e.g. K  A C D E F G H I K L M N P Q R S T V W Y
000 000 00100 0 00 0000 0 0

Each shell contains 24 features.
4 for counts of C, N, O, S and 20 for counts of amino acids

20 + 24 x 5 = 140 features
60 sites among 245 positive sites  ~24%
3239 sites among 12906 negative sites  ~25%
1X140 zero vector for the other sites
256(Seq) + 84(Evol) + 140(Str) = 480 features
Prediction Results – Random Forest
Features
Accuracy
AUC
Seq + Evol + Str
65.2 +/- 22.8
71.5 +/- 25.3
Seq + Evol
63.9 +/- 23.4
69.8 +/- 24.9
Seq + Str
66.2 +/- 22.3
70.6 +/- 24.3
Evol + Str
56.7 +/- 23.1
59.2 +/- 27.7
Seq
64.6 +/- 22.4
70.1 +/- 24.3
Prediction Results – Random Forest
1
True Positive Rate
0.8
0.6
0.4
Seq + Evol + Str
Seq + Evol
Seq + Str
Evol + Str
Seq
0.2
0
0
0.2
0.4
0.6
False Positive Rate
AUC = 71.6
AUC = 71.0
AUC = 71.1
AUC = 60.4
AUC = 70.5
0.8
1
Prediction Results – SVM
Features
Accuracy
AUC
Seq + Evol + Str
63.8 +/- 23.2
71.2 +/- 25.3
Seq + Evol
63.7 +/- 23.4
71.0 +/- 25.4
Seq + Str
65.1 +/- 23.5
71.3 +/- 25.0
Evol + Str
56.6 +/- 22.2
59.9 +/- 28.0
Seq
65.8 +/- 23.0
71.2 +/- 25.0
Prediction Results – SVM
1
0.9
True Positive Rate
0.8
0.7
0.6
0.5
0.4
Seq + Evol + Str AUC = 70.0
Seq + Evol
AUC = 69,9
Seq + Str
AUC = 69.7
Evol + Str
AUC = 59.8
Seq
AUC = 69.6
0.3
0.2
0.1
0
0
0.2
0.4
0.6
False Positive Rate
0.8
1
Feature Selection

Rank features using correlation coefficients (r).
r
-0.0790
Feature
Net charge
-0.0585
0.0530
0.0513
0.0481
K frequency
E frequency
D frequency
Predicted B-factor
0.0471
0.0448
-0.0387
Protein disorder
Vihinen's flexibility
Hydrophobic moment
-0.0383
L frequency
Conclusions

Ubiquitination sites are predictable.

The accuracy is modest.





Long range interactions
Flexibility of structure
Noise in positive sites
Small data set
The sequence features are the most important.
Acknowledgements

Prof. Predrag Radivojac
Wyatt Clark
 Arunima Ram
 Nils Schimmelmann

Prof. Sun Kim
 Linda Hostetter
 School of Informatics

Thank you!