Protein Secondary Structure Prediction

Download Report

Transcript Protein Secondary Structure Prediction

Protein Secondary
Structure Prediction
Dong Xu
Computer Science Department
271C Life Sciences Center
1201 East Rollins Road
University of Missouri-Columbia
Columbia, MO 65211-2060
E-mail: [email protected]
573-882-7064 (O)
http://digbio.missouri.edu
Outline
 What
is Secondary Structure
 Introduction
to Secondary Structure
Prediction
 Chou-Fasman
 Nearest
 Neural
Method
Neighbor Method
Network Method
Structures in Protein
Language:
Letters  Words  Sentences
Protein:
Residues  Secondary Structure
Tertiary Structure
a helix
 Single protein chain (local)
 Shape maintained by
intramolecular H bonding
between -C=O and H-N-
b sheet
 Several protein chains
 Shape maintained by
intramolecular H bonding
between chains
 Non-local on protein
sequence
b-sheet
(parallel, anti-parallel)
Classification of
secondary structure

Defining features
 Dihedral angles
 Hydrogen bonds
 Geometry

Assigned manually by experimentalists

Automatic
 DSSP (Kabsch & Sander,1983)
 STRIDE (Frishman & Argos, 1995)
 Continuum (Andersen et al.)
Classification

Eight states from DSSP









H: a-helix
G: 310 helix
I: p-helix
E: b-strand
B: bridge
T: b-turn
S: bend
24
25
26
27
28
29
30
31
32
33
26
27
28
29
34
35
36
37
38
E
R
N
K
!
C
I
L
V
G
C: coil
CASP Standard
 H = (H, G, I), E = (E, B), C = (C, T, S)
H
H
E
E
E
E
< S+
< S+
<
-cd
-cd
-cd
-cd
0
0
0
0
0
0
58
59
60
61
0 132
0 125
0
41
0 197
0
0
0
73
89B
9
90B
2
91B
0
92B
0
Dihedral angles
Ramachandran plot (alpha)
Ramachandran plot (beta)
Outline
 What
is Secondary Structure
 Introduction
to Secondary Structure
Prediction
 Chou-Fasman
 Nearest
 Neural
Method
Neighbor Method
Network Method
What is secondary
structure prediction?

Given a protein sequence (primary structure)
GHWIATRGQLIREAYEDYRHFSSECPFIP

Predict its secondary structure content
(C=Coils H=Alpha Helix E=Beta Strands)
CEEEEECHHHHHHHHHHHCCCHHCCCCCC
Why secondary structure
prediction?
o
An easier problem than 3D structure prediction
(more than 40 years of history).
o
Accurate secondary structure prediction can be
an important information for the tertiary structure
prediction
o
Protein function prediction
o
Protein classification
o
Predicting structural change
Prediction methods
o Statistical method
o Chou-Fasman method, GOR I-IV
o Nearest neighbors
o NNSSP, SSPAL
o Neural network
o PHD, Psi-Pred, J-Pred
o Support vector machine (SVM)
o HMM
Accuracy measure

Three-state prediction accuracy: Q3
Q3  correctly predicted residues
number of residues

A prediction of all loop: Q3 ~ 40%

Correlation coefficients
Improvement of accuracy
1974 Chou & Fasman
1978 Garnier
1987 Zvelebil
1988 Qian & Sejnowski
1993 Rost & Sander
1997 Frishman & Argos
1999 Cuff & Barton
1999 Jones
2000 Petersen et al.
~50-53%
63%
66%
64.3%
70.8-72.0%
<75%
72.9%
76.5%
77.9%
Prediction accuracy (EVA)
25
P SIP RED
SSp ro
P ROF
P HDps i
JP red 2
P HD
Percentage of all 150 proteins
20
15
10
5
0
30
40
50
60
70
80
90
1 00
P ercen tag e co rrectl y pred i cted resi d ues per p rot ei n
How far can we go?
 Currently ~76%
 1/5 of proteins with more than 100 homologs
 >80%
 Assignment is ambiguous (5-15%).
 non-unique protein structures, H-bond cutoff, etc.
 Some segments can have multiple structure
types.
 Different secondary structures between
homologues (~12%). Prediction limit  88%.
 Non-locality.
Assumptions
o The entire information for forming secondary
structure is contained in the primary sequence.
o Side groups of residues will determine structure.
o Examining windows of 13 - 17 residues is
sufficient to predict structure.
o Basis for window size selection:
a-helices 5 – 40 residues long
b-strands 5 – 10 residues long
Outline
 What
is Secondary Structure
 Introduction
to Secondary Structure
Prediction
 Chou-Fasman
 Nearest
 Neural
Method
Neighbor Method
Network Method
Secondary structure
propensity

From PDB database, calculate the propensity
for a given amino acid to adopt a certain ss-type
P(a | aai )
p(a , aai )
Pa 

p(a )
p(a ) p(aai )
i

Example:
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(a,aai) = 500/20,000, p(a)  4,000/20,000, p(aai) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Chou-Fasman algorithm

Helix, Strand
1.
2.
3.

Scan for window of 6 residues where average score > 1 (4
residues for helix and 3 residues for strand)
Propagate in both directions until 4 (or 3) residue window with
mean propensity < 1
Move forward and repeat
Conflict solution
Any region containing overlapping alpha-helical and beta-strand
assignments are taken to be helical if the average P(helix) >
P(strand). It is a beta strand if the average P(strand) > P(helix).

Accuracy: ~50%  ~60%
GHWIATRGQLIREAYEDYRHFSSECPFIP
Initiation
Identify regions where 4/6 have a P(H) >1.00
“alpha-helix nucleus”
P(H)
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Propagation
Extend helix in both directions until a set of
four residues have an average P(H) <1.00.
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Outline
 What
is Secondary Structure
 Introduction
to Secondary Structure
Prediction
 Chou-Fasman
 Nearest
 Neural
Method
Neighbor Method
Network Method
Nearest neighbor method
o
Predict secondary structure of the central
residue of a given segment from homologous
segments (neighbors)
 (i) From database, find some number of the closest
sequences to a subsequence defined by a window
around the central residue, or
 (ii) Compute K best non-intersecting local alignments
of a query sequence with each sequence.
o
Use max (na, nb, nc) for neighbor consensus or
max(sa, sb, sc) for consensus sequence hits
Environment preference score

Each amino acid has a preference to a specific
structural environments.

Structural variables:
 secondary structure, solvent accessibility

Non-redundant protein structure database: FSSP
S (i, j )  log
p(aai | E j )
p(aa i )
 log
p(aai , E j )
p(aai ) p( E j )
“Singleton” score matrix
ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
Helix
Buried Inter
-0.578 -0.119
0.997 -0.507
0.819 0.090
1.050 0.172
-0.360 0.333
1.047 -0.294
0.670 -0.313
0.414 0.932
0.479 -0.223
-0.551 0.087
-0.744 -0.218
1.863 -0.045
-0.641 -0.183
-0.491 0.057
1.090 0.705
0.350 0.260
0.291 0.215
-0.379 -0.363
-0.111 -0.292
-0.374 0.236
Exposed
-0.160
-0.488
-0.007
-0.426
1.831
-0.939
-0.721
0.969
0.136
1.248
0.940
-0.865
0.779
1.364
0.236
-0.020
0.304
1.178
0.942
1.144
Sheet
Buried Inter
0.010 0.583
1.267 -0.345
0.844 0.221
1.145 0.322
-0.671 0.003
1.452 0.139
0.999 0.031
0.177 0.565
0.306 -0.343
-0.875 -0.182
-0.411 0.179
2.109 -0.017
-0.269 0.197
-0.649 -0.200
1.249 0.695
0.303 0.058
0.156 -0.382
-0.270 -0.477
-0.267 -0.691
-0.912 -0.334
Exposed
0.921
-0.580
0.046
0.061
1.216
-0.555
-0.494
0.989
-0.014
0.500
0.900
-0.901
0.658
0.776
0.145
-0.075
-0.584
0.682
0.292
0.089
Loop
Buried Inter
0.023 0.218
0.930 -0.005
0.030 -0.322
0.308 -0.224
-0.690 -0.225
1.326 0.486
0.845 0.248
-0.562 -0.299
0.019 -0.285
-0.166 0.384
-0.205 0.169
1.925 0.474
-0.228 0.113
-0.375 -0.001
-0.412 -0.491
-0.173 -0.210
-0.012 -0.103
-0.220 -0.099
-0.015 -0.176
-0.030 0.309
Exposed
0.368
-0.032
-0.487
-0.541
1.216
-0.244
-0.144
-0.601
0.051
1.336
1.217
-0.498
0.714
1.251
-0.641
-0.228
-0.125
1.267
0.946
0.998
Total score

Alignment score is the sum of score in a window of length l:
Score(i, j ) 
l/2
 [M (i  k , j  k )  cS (i  k , j  k )]
k - l / 2
i-4 i-3 i-2 i-1
i
i+1 i+2 i+3 i+4
T R G Q L I R E A Y E D Y R H F S S E C P F I P
|
| |
| |
. . .E C Y E Y B R H R . . . .
j-4 j-3 j-2 j-1
j
j+1 j+2 j+3 j+4
L H H H H H H L L
Neighbors
1
2
3
4
n
n+1
-
L
L
L
L
L
H
H
L
E
E
L
H
H
H
E
E
L
H
H
H
E
E
L
L
H
H
E
E
E
L
H
H
E
E
E
L
H
H
E
E
E
E
L
L
L
L
E
E
L
L
L
L
E
E
-
S1
S2
S3
S4
Sn
Sn+1
:

max (na, nb, nL) or max (Ssa, Ssb, SsL)
Evolutionary information





“All naturally evolved proteins with more than 35%
pairwise identical residues over more than 100 aligned
residues have similar structures.”
Stability of structure w.r.t. sequence divergence (<12%
difference in secondary structure).
Position-specific sequence profile, containing crucial
information on evolution of protein family, can help
secondary structure prediction (increase information
content).
Gaps rarely occur in helix and strand.
~1.4%/year increase in Q3 due to database growth
during past ~10 years.
How to use it

Sequence-profile alignment.

Compare a sequence against protein family.

More specific.

BLAST vs. PSI-BLAST.

Look up PSSM instead of PAM or BLOSUM.
Score(i, j ) 
l/2
 [ PSSM ( j  k , i  k )  cS (i  k , j  k )]
k - l / 2
position
amino acid type
Outline
 What
is Secondary Structure
 Introduction
to Secondary Structure
Prediction
 Chou-Fasman
 Nearest
 Neural
Method
Neighbor Method
Network Method
Neurons
normal state
addictive state
Neural network
Input layer
Hidden layer
Output layer
J1
J2
J3
J4
3.
neurons
Input signals are summed
and turned into zero or one
Feed-forward multilayer network
Neural network training
Adjust Weights
Compare Prediction to Reality
Enter sequences
Simple neural network
Simple Neural Net work
1
J1 1
out 0 = J1 1 in 1 + J1 2 in 2
1
out = t anh (out 0 )
1
J1 2 0
Error = | out_net – out_desired |
Training a neural network
Er r o r
Junct ions
Simple neural network
with hidden layer
Simple Neural Net work
Wit h Hidden Layer

out i  f 

J
j
2
ij

 f 

 J  in
1
jk
k


k 

Neural network for
secondary structure
D (L)
R (E)
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
.
Q (E)
G (E)
F (E)
V (E)
P (E)
H
E
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
L
PsiPred

D. Jones, J. Mol. Boil. 292, 195 (1999).

Method : Neural network

Input data : PSSM generated by PSI-BLAST

Bigger and better sequence database
 Combining several database and data filtering

Training and test sets preparation
 Secondary structure prediction only makes sense for proteins
with no homologous structure.
 No sequence & structural homologues between training and test
sets by PSI-BLAST (mimicking realistic situation).
Psi-Pred (details)



Window size = 15
Two networks
First network (sequence-to-structure):






Second network (structure-to-structure):





315 = (20 + 1)  15 inputs
extra unit to indicate where the windows spans either N or C terminus
Data are scaled to [0-1] range by using 1/[1+exp(-x)]
75 hidden units
3 outputs (H, E, L)
Structural correlation between adjacent sequences
60 = (3 + 1)  15 inputs
60 hidden units
3 outputs
Accuracy ~76%
Reading Assignments

Suggested reading:
 Chapter 15 in “Current Topics in
Computational Molecular Biology, edited by
Tao Jiang, Ying Xu, and Michael Zhang. MIT
Press. 2002.”

Optional reading:
 Review by Burkhard Rost:
http://cubic.bioc.columbia.edu/papers/2003_r
ev_dekker/paper.html
Project Assignment
Develop a program that implements ChouFasman Algorithm
1.
TA will give you a matrix table of Chou-Fasman
indices
2.
Using the FASTA as input format for sequence
3.
Output format:
KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS
HHHHHH HHHH
HHHHHH HHHHHH
EEE
TDYGILQINS RWWCNDGRTP GSRNLCNIPC
EEE
EE