Support vector machine approach for protein subcelluar

Download Report

Transcript Support vector machine approach for protein subcelluar

Support vector machine approach
for protein subcelluar localization prediction
(SubLoc)
Kim Hye Jin
Intelligent Multimedia Lab.
2001.09.07.
Contents
• Introduction
• Materials and Methods
– Support vector machine
– Design and implementation of the
prediction system
– Prediction system assessment
• Result
• Discussion and Conclusion
Introduction (1)
• Motivation
– A key functional charactristic of potential gene
products such as proteins
• Traditional methods
– Protein N-terminal sorting signals
• Nielsen et al.,(1999), von Heijne et al (1997)
– Amino acid composition
• Nakashima and Inshikawa(1994), Nakai(2000)
Andrade et al(1998), Cedano et al(1997), Reinhart and
Hubbard(1998)
Materials and Methods(1)
• Dataset - SWISSPROT release 33.0
-Essential sequences which complete
and reliable localization annotations
-No transmembrane proteins
By Rost et al.,1996; Hirokawa et al.,1998;Lio and
Vnnucci,2000
-Redundancy reduction
-Effectiveness test
- by Reinhardt and Hubbard (1998)
Support vector machine(1)
• A quadratic optimization problem with boundary
constraints and one linear equality constraints
• Basically for two classification problem
input vector x =(x1, .. x20) ( xi :aa)
output vector y∈{-1,1}
• Idea
– Map input vectors into a high dimension feature space
– Construct optimal separating hyperplane(OSH)
– maximize the margin; the distance between hyperplane
and the nearest data points of each class in the space H
– Mapping by a kernel function K(xi,xj)
Support vector machine(2)
• Decision function
• Where the coefficient
quadratic programming
by solving convex
Support vector machine(3)
• Constraints
– In eq(2), C is regularization parameter => control the
trade-off between margin and misclassification error
• Typical kernel functions
Eq(3), polynomial with d parameter
Eq(4), radial basic function (RBF) with r parameter
Support vector machine(4)
• Benefits of SVM
– Globally optimization
– Handle large feature spaces
– Effectively avoid over-fitting by
controlling margin
– Automatically identify a small subset
made up of informative points
Design and implementation of the
prediction system
• Problem :Multi-class classification problem
– Prokaryotic sequences 3 classes
– Eukaryotic sequences 4 classes
• Solution
– To reduce the multi-classification into binary
classification
– 1-v-r SVM( one versus rest )
• QP problem
– LOQO algorithm (Vanderbei, 1994)
• SVMlight
• Speed
– Less than 10 min on a PC running at 500MHz
Prediction system assessment
• Prediction quality test by jackknife test
– Each protein was singled out in turn as a test
protein with the remaining proteins used to train
SVM
Results (1)
• SubLoc prediction accuracy by jackknife test
– Prokaryotic sequence case
• d=1and d=9 for polynomial kernel
• =5.0 for RBF
• C = 1000 for SVM constraints
– Eukaryotic sequence case
• d =9 for polynomial kernel
•
=16.0 for RBF
• C=500 for each SVM
• Test : 5–fold cross validation
( since limited computational power)
Comparison
• based on amino acid composition
– Neural network
• Reinhardt and Hubbard, 1998
– Covariant discriminant algorithm
• Chou and Elrod, 1999
• Based on the full sequence information in genome
sequence
– Markov model ( Yuan, 1999)
Assigning a reliability index
• RI (reliability index)
Diff between the highest
and the second highest output value of
the 1-v-r SVM
• 78% of all sequence
have RI ≥3 and 95.9%
correct prediction
Robustness
to errors in the N-terminal sequence
Discussion and Conclusion
• SVM information condensation
– The number of SVs is quite small
– The ratio of SVs to all training is 13-30%
SVM parameter selection
• Little influence on the classification
performance
– Table8 shows with little difference
between kernel functions
– Robust characteristic of the dataset
by Vapnik(1995)
Improvement of the perfomance
• Combining with other methods
– Sorting signal base method and amino
acid composition
• Signal : sensitive to errors in N terminal
• Composition: weakness in similar aa
• Incorporate other informative features
• Bayesian system integrating in the whole
genome expression data
• Fluorescence microscope images