A novel Method of Protein Secondary Structure Prediction

Transcript A novel Method of Protein Secondary Structure Prediction

A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approach by S. Hua & Z. Sun

Presenter: Sancar Adali

Outline  Problem Statement  SVM primer  Methods and Parameters  Results  Conclusion  Discussion

Problem Statement  Predicting Secondary Protein Structure from amino acid Sequences Secondary Protein Structure: The local shape of polypeptide chain dependent on the hydrogen bonds between different amino acids In the big picture, the goal is to solve tertiary/quaternary structure in 3D. By solving the more easily tackled secondary structure problem in 2D, we’re solving an easier problem.

Some important terminology  

segment

: a group of consecutive residues that form a secondary structure such as helix, coil, sheet 

residue:

When two or more amino acids combine to form a peptide, the elements of water are removed, and what remains of each amino acid is called an amino acid residue.

C-terminus:

group… The

residue

that has a free carboxyl 

N-terminus:

The

residue

group that is free… in a peptide that has an amino [International Union of Pure and Applied Chemistry definitions]

Problem Statement  Our main goal is to find Sequence to Structure relationship. We usually know the amino acid sequence of proteins. The problem is complicated, because of the many interactions/bonds between amino acid residues. We start at a lower level of complexity and try to predict secondary structure.

Amino acids to Proteins Amino Amino N-terminus H R 1 H Carboxyl(COO) Amino(NH 3 ) H R 1 O C H N Peptide bond H Carboxyl H R 2 Carboxyl R 2 C-terminus

Linear Decision Boundary    # of classes: 2 y: output {-1,1};

n ;

=((

1 ,y 1 ), (

2 ,y 2 )…(

n ,y n )



 j N   1 w j x j Hyperplane: H = {

∙

+b = 0}

SVM(Support Vector Machine)  Linearly Separable Training Samples

SVM(Support Vector Machine)  Original Idea  Choose H by  Maximizing min{||

i ||:

w ∙x

+b = 0, i=1,…n} with respect to

and b  Maximizing the margin(the distance from hyperplane to the closest positive and negative training points)

Solving Linearly Separable Case  Hyperplane: H = {

∙

+b = 0}  M=d + + d  Choose

,b to maximize margin. Then f SVM =sign(

∙

+b) Since data points are linearly separable, there exist such

and b y i (

∙

i +b)>0 for all i =1, …k

Solving Linearly Separable Case  Rescale

, b to make the points closest to HP satisfy |

i +b|=1(standardizing distance)  y i (

∙

i + b)≥1 for all i = 1,2,…k  Now we have two parallel planes to H H 1 :



 b  1 H 2 :



 b   1

Linearly Separable Case Soln.

The distance between H 1 , H 2 is 2/||

||.

Linearly Separable Case Soln.

 We want to maximize distance between hyperplane H 1 and H minimizing ||

|| 2 which means  The problem has become an optimization problem where the objective function is min

, b

2 such that y i (

i 

 b )  1 , i  1 ,...

  The decision boundary depends only on a subset of training samples. These are called “Support Vectors”.

Introduce α i ≥0 :Lagrange (nonnegative) multipliers to solve the problem  Solution and the decision function is

 i n   1  i y i

i  i : x n   S i  i y i

i f SVM  sign ( i n   1  i y i

i 

 b )

Error bound on SVM using LOOCV

Not Linearly Separable Case    Introduce slack variables ξ i ≥0 to take classification errors into account If ξ i > 1, we make an error Add error term ξ i objective function to the y i (



 b )  1   i min

, b

2  C ( i n   1  i ) such that y i (

i 

 b )  1   i , i  1 ,...

Not Linearly Separable Case Solution is again

 i n   1  i y i

i  i : x n   S i  i y i

i f SVM  sign ( i n   1  i y i

i 

 b ) 0   i  C

Nonlinear SVM  Important observation  Training samples only appear in dot products:

i 

j  Map the training samples to a higher-dimensional space where they become linearly separable and solve the decision problem there

Nonlinear SVM Φ :

  

{(Φ(

i ), ( possibly y i )} i k  1 dim

  ) We don’t need to know Φ explicitly, just need  ( x i )   ( x j )   ( x i ),  ( x j )

It can be represented by a kernel function K :

N 

N  ( x i )   ( x j ) 

  K( x i , x j )

Nonlinear SVM f ( x )  sign ( i N  S  1  i y i  ( s i )  ( x )  sign ( i N  S  1  i y i K ( s i , x )  b )  b ) Φ exists neither in the decision function nor in the training step

Typical Kernels

K ( x , x ' )



( x



x '



1 )

K ( x , x ' ) K ( x , x ' )

  

arctan( x



x '

2

 2 

x



x ) '

 

)

Methods  Use of SVMs with Radial Basis Function  Three classes H( α-helix), E(-sheet), C(coil,and the remaining part)  Construction of tertiary classifiers from multiple binary classifiers(SVMs are binary classifiers)

SVM Implementation  Gaussian RBF with r=0.10

 The authors claim this parameter is not critical and they have tried other values  C: the penalty for classification errors. Must be chosen carefully to find a balance between overfitting(too high penalty) and poor generalization(too low penalty). The optimal value for C was calculated using 7-fold CV in this paper

Encoding Amino Acid Sequence Information  20 amino acids + 1 for N or C terminus : 21 binary unit vectors (1,0,…,0),(0,1,…,0)  Concatenate these vectors for each sliding window  For Window Length ℓ: Dimension of Feature Vector is 21∙ℓ

Encoding Example  Assume ℓ=5 and there are 3 amino acids(for our purposes),named Arg(1,0,0,0), Lys(0,1,0,0), Met(0,0,1,0). The two ends of the chain are encoded separately as (0,0,0,1)  The feature vector for Arg* residue in the sequence Arg,Lys,Arg*,Met,Arg,Met,Met would be encoded as

i = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0)  This will be our input to SVM.

Determination of Optimal Window Length  Determines dimension of feature vector  Determined using 7-fold cross-validation

Data Sets  RS126 Set: 126 nonhomologous proteins  non-homologous:no two proteins in the set share more than 25 % sequence identity over a length of more than 80 residues  CB513 Set: 513 nonhomologous proteins. Almost all the sequences in the RS126 set are included in the CB513 set.  non-homologous data set: i.e. an SD score of ≥5 is regarded as homology. SD score(distance of “alignment score” from the mean score of randomized sequences in terms of std) is a more stringent measure than the percentage identity. In fact, 11 pairs of proteins in the RS126 set are sequence similar when using the SD score instead of percentage identity. The CB513 set contains 16 chains of 30 residues and short chains usually decrease accuracy.

Protein Secondary Structure Definition H(α-helix), G(310-helix) I(π-helix), E(β-strand), B(isolated β-bridge), T(turn), S(bend) -(rest). Reduction 1

H E C

Reduction 2

Reliability Index(of single classifications)  Using a discretized function of the sample distance to the separating hyperplane  Compares it with accuracy measure to prove the point RI increases monotonically with increasing accuracy

Accuracy Measures Estimate of p (  I | Y  I )  p ( Y  p ( Y I , Y  I )  I ) Estimate of p ( Y  I |  I )  p ( Y p (  I ,  I )  I ) Y is the class/state variable and Ŷ is the output of classifier

Segment Overlap Measure  Segment Overlap Measure: A more realistic way of measuring accuracy compared to per-residue accuracy  In terms of relevancy to 3D structure  Problems with per-residue accuracy measurement  Proteins with same 3D folding differ by 12% in Secondary Structure  This means maximum performance of Q 3 should =88%  End of segments might vary for proteins with same 3D structure. (so their classification is less relevant to determining protein structure)  Account for the following  Type and position of secondary structure segments rather than per residue  Variation of the residues at the end of the segments

Comparison of SOV and per-residue

Accuracy Measures Correlation Measure C   (n   u  ) ( p  n (n    ) o   ) ( u  o  (p   ) u  ) (p   o  ) where p α :# of true positives n α : :# of true positives o α : :# of false positives u α : :# of false negatives α stands for H,C or E

Tertiary Classifier Design H/~H E/~ E C/~ C Choose one with maximum margin: SVM_MAX_D

Tertiary Classifier Design H/~H E/~ E C/~ C H/C E/H E/C Vote SVM_VOTE Neural Network SVM_NN Combine all of tertiary classifiers designed so far to form a jury SVM_JURY

Results

Results(Comparison with PHD)

Results,Details – Support Vectors In typical pattern recognition problems # SVs/Training Samples is much lower.

Interpretation: Remember the error bound found by LOOCV was EV(# SVs)/ Training samples. This means the error bound will be very high compared to typical pattern recognition problems.

Conclusions  Application of SVM method on prediction with good results(improvement over PHD).

 Importance of accuracy measure used to measure performance of algorithms and class reduction method  Use of machine learning method in solving protein structure problem at lower complexity  The general protein folding problem in three dimensions requires too much computational power, so this is a cheap alternative.

Critique of article  Number of SVs too large. This puts the performance of SVM into question.

 The article doesn’t put the machine learning aspects into clear form, too focused on results

A novel Method of Protein Secondary Structure Prediction

Transcript A novel Method of Protein Secondary Structure Prediction