Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –

Transcript Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –

Protein Secondary Structures
Assignment and prediction
Pernille Andersen
23.04.2007
Outline
• What is protein secondary structure
• How can it be used?
• Different prediction methods
– Alignment to homologues
– Propensity methods
– Neural networks
• Evaluation of prediction methods
• Links to prediction servers
Secondary Structure Elements
ß-strand
Helix
Bend
Turn
Use of secondary structure
•
•
•
•
•
•
Classification of protein structures
Definition of loops (active sites)
Use in fold recognition methods
Improvements of alignments
Definition of domain boundaries
Input for a number of alterntive
bioinformatics tools
Classification of secondary
structure
• Defining features
– Dihedral angles
– Hydrogen bonds
– Geometry
• Assigned manually by crystallographers or
• Automatic
– DSSP (Kabsch & Sander,1983)
– STRIDE (Frishman & Argos, 1995)
– DSSPcont (Andersen et al., 2002)
Dihedral Angles
From http://www.imb-jena.de
phi dihedral angle of the N-Calpha bond
psi dihedral angle of the Calpha-C bond
omega dihedral angle of the C-N (peptide) bond
Helices
phi(deg) psi(deg) H-bond pattern
----------------------------------------------------------alpha-helix
-57.8
-47.0
i+4
pi-helix
-57.1
-69.7
i+5
310 helix
-74.0
-4.0
i+3
(omega = 180 deg )
From http://www.imb-jena.de
Beta Strands
phi(deg) psi(deg) omega (deg)
-----------------------------------------------------------------beta strand
-120
120
180
Antiparallel
Parallel
From http://broccoli.mfn.ki.se/pps_course_96/
Secondary Structure Elements
ß-strand
Helix
Bend
Turn
Secondary Structure Type
Descriptions
*
*
*
*
*
*
*
*
H = alpha helix
G = 310 - helix
I = 5 helix (pi helix)
E = extended strand, participates in beta ladder
B = residue in isolated beta-bridge
T = hydrogen bonded turn
S = bend
C = coil
Automatic assignment programs
• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )
• STRIDE (http://bioweb.pasteur.fr/seqanal/interfaces/stride.html)
• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )
• The protein data bank visualizes DSSP assignments on structures in
the data base (go to sequence details tab)
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
RESIDUE AA STRUCTURE BP1 BP2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A
A
A
A
A
A
A
A
A
A
A
A
A
A
E
H
V
I
I
Q
A
E
F
Y
L
N
P
D
E
E
E
E
E
E
E
E
T
T
T
-A
-A
-A
+A
+A
-A
-A
>> -A
45S+
45S+
45S-
0
0
0
23
22
21
20
19
18
17
16
0
0
0
0
0
0
0A
0A
0A
0A
0A
0A
0A
0A
0
0
0
ACC
205
127
66
106
74
86
18
63
31
36
24
54
114
66
N-H-->O
0, 0.0
2, 0.0
-2,-0.3
-2,-0.4
17,-2.8
-2,-0.4
13,-2.5
-2,-0.4
9,-1.5
-2,-0.3
5,-3.2
-2,-0.4
0, 0.0
2,-0.1
O-->H-N
2,-0.3
2,-0.4
21,-2.6
2,-0.4
17,-2.8
2,-0.4
13,-2.5
2,-0.3
9,-1.8
2,-0.4
4,-1.7
-2, 0.0
-1,-0.2
-2,-0.2
N-H-->O
0, 0.0
21, 0.0
2, 0.0
19,-0.2
-2,-0.5
15,-0.2
-2,-0.9
11,-0.2
-2,-0.3
7,-0.2
-2,-0.4
2,-0.2
0, 0.0
1,-0.1
O-->H-N
0, 0.0
21, 0.0
2,-0.5
19,-0.2
2,-0.9
15,-0.2
2,-0.3
11,-0.2
2,-0.4
7,-0.2
5,-1.3
0, 0.0
-2, 0.0
3,-0.1
TCO
KAPPA ALPHA
PHI
PSI
0.000 360.0 360.0 360.0 113.5
-0.987 360.0-152.8-149.1 154.0
-0.995
4.6-170.2-134.3 126.3
-0.976 13.9-170.8-114.8 126.6
-0.972 20.8-158.4-125.4 129.1
-0.910 29.5-170.4 -98.9 106.4
-0.852 11.5 172.8-108.1 141.7
-0.933
4.4 175.4-139.1 156.9
-0.967 13.3-160.9-160.6 151.3
-0.994 16.5-156.0-136.8 132.1
-0.929 11.7-122.6-120.0 133.5
-0.884 84.3
9.0-113.8 150.9
-0.963 125.4 60.5 -86.5
8.5
0.752 89.3-146.2 -64.6 -23.0
X-CA
5.7
9.4
11.5
15.0
16.6
19.9
20.7
23.4
24.4
27.2
28.0
29.7
32.0
33.0
Y-CA
42.2
41.3
38.4
37.6
34.9
33.0
31.8
29.4
27.6
25.3
24.8
22.0
21.6
25.2
Z-CA
25.1
24.7
23.5
24.5
22.4
23.0
19.5
18.4
15.3
14.1
10.4
8.6
6.8
7.6
Secondary Structure Prediction
• What to predict?
– All 8 types or pool types into groups
DSSP
Q3
*
*
*
H = alpha helix
G = 310 -helix
I = 5 helix (pi helix)
*
*
E = extended strand
B = beta-bridge
E
*
*
*
T = hydrogen bonded turn
S = bend
C = coil
C
H
Secondary Structure Prediction
• What to predict?
– All 8 types or pool types into groups
*
H = alpha helix
*
E = extended strand
Straight HEC
Q3
H
E
*
*
*
*
*
*
T = hydrogen bonded turn
S = bend
C = coil
G = 310-helix
I = 5 helix (pi helix)
B = beta-bridge
C
Secondary Structure
Prediction
• Simple alignments
• Align to a close homolog for which the structure has been
experimentally solved.
• Heuristic Methods (e.g., Chou-Fasman, 1974)
• Apply scores for each amino acid an sum up over a window.
• Neural Networks
•
•
•
•
Raw Sequence (late 80’s)
Blosum matrix (e.g., PhD, early 90’s)
Position specific alignment profiles (e.g., PsiPred, late 90’s)
Multiple networks balloting, probability conversion, output
expansion (Petersen et al., 2000).
Improvement of accuracy
1974 Chou & Fasman
1978 Garnier
1987 Zvelebil
1988 Quian & Sejnowski
1993 Rost & Sander
1997 Frishman & Argos
1999 Cuff & Barton
1999 Jones
2000 Petersen et al.
~50-53%
63%
66%
64.3%
70.8-72.0%
<75%
72.9%
76.5%
77.9%
Simple Alignments
•Solved structure of a homolog to query is
needed
•Homologous proteins have ~88% identical
(3 state) secondary structure
• If no close homologue can be identified
alignments will give almost random results
Propensities: Amino acid
preferences in -Helix
Propensities: Amino acid
preferences in -Strand
Propensities: Amino acid
preferences in coil
Chou-Fasman propensities
Name
Ala
Arg
Asp
Asn
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
37
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
f(i)
0.06
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
Chou-Fasman
• Generally applicable
• Works for sequences with no solved
homologs
• But the accuracy is low!
• The problem is that the method does
not use enough information about the
structural context of a residue
Neural Networks
• Benefits
– Generally applicable
– Can capture higher order correlations
– Inputs other than sequence information
• Drawbacks
– Needs a high amount of data (different solved
structures). However, today nearly 7000 structures
with low sequence identity/high resolution are solved
– Complex method with several pitfalls
Architecture
Weights
Input Layer
IK
EE
H
VI
HE
C
IQ
AE
Hidden Layer
Window
IKEEHVIIQAEFYLNPDQSGEF…..
Output Layer
Sparse encoding
Inp Neuron
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
A
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
R
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
D
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Q
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
AAcid
Input Layer
0
0
0
0
IK
EE
HV
I
0
0
1
0
0
0
0
IQ
AE
0
0
0
0
0
0
0
0
0
BLOSUM 62
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
B
-2
-1
3
4
-3
0
1
-1
0
-3
-4
0
-3
-3
-2
0
-1
-4
-3
-3
Z
-1
0
0
1
-3
3
4
-2
0
-3
-3
1
-1
-3
-1
0
-1
-3
-2
-2
X
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
0
0
-2
-1
-1
*
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
Input Layer
1
0
0
IK
EE
HV
I
2
4
2
5
IQ
AE
2
0
3 3
1
2 3 1
0
1 3 2 2
Secondary networks
(Structure-to-Structure)
Weights
Input Layer
HE
CH
E
CH
EC
Window
HE
C
Hidden Layer
IKEEHVIIQAEFYLNPDQSGEF…..
Output Layer
PHD method
(Rost and Sander)
• Combine neural networks with sequence profiles
– 6-8 Percentage points increase in prediction accuracy
over standard neural networks
• Use second layer “Structure to structure”
network to filter predictions
• Jury of predictors
• Set up as mail server
PSI-Pred (Jones)
• Use alignments from iterative sequence
searches (PSI-Blast) as input to a neural
network
• Better predictions due to better sequence
profiles
• Available as stand alone program and via
the web
Position specific scoring
matrices
(PSI-BLAST profiles)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
I
K
E
E
H
V
I
I
Q
A
E
F
Y
L
N
P
D
A
-2
-1
5
-4
-4
-3
0
-3
-2
2
-1
-3
3
-1
-1
-2
-3
R
-4
-1
-3
-3
2
0
-2
0
-3
-4
3
-5
-5
-3
-4
4
-2
N
-5
-2
-3
2
1
-4
-4
-5
-2
-4
1
-5
-5
-4
4
-4
1
D
-5
-2
-3
5
1
-5
1
-5
-3
-3
1
-5
-6
-2
1
-4
5
C
-2
-3
-3
-6
-5
-4
-4
-4
-5
2
-1
-4
3
1
5
-5
-6
Q
-4
-1
3
1
1
-4
-2
-2
4
-3
0
-4
-4
5
-3
0
-2
E
-4
3
1
5
-2
-2
-4
-5
-1
-1
1
-4
-5
1
-4
-3
2
G
-5
-3
-2
-4
-4
-3
-4
-6
3
-4
-4
-1
-2
-1
2
3
2
H
-5
-2
-3
-3
9
-5
-5
1
5
-2
-3
-1
-1
-1
-4
2
-1
I
6
-2
-3
-6
-5
1
1
2
-5
1
-1
1
0
-1
-4
-5
-2
L
0
-3
-3
-6
-2
-2
0
4
-3
-1
-3
1
-4
1
-4
-4
-2
K
-4
4
-2
-2
-3
1
-2
-4
-3
-4
0
-5
-5
-3
-3
0
-3
M
0
-2
-2
-5
-4
0
0
-1
-4
-3
3
2
-3
-3
-2
-4
-5
F
-2
-4
-4
-6
-4
1
2
0
-2
-4
-5
5
3
1
-4
-3
-4
P
-4
-3
-3
-4
-5
-4
-5
-5
-4
1
4
-1
-5
-5
-5
0
-5
S
-4
1
-1
-2
-3
-3
1
-2
2
2
-1
-4
-2
-1
2
1
-1
T
-2
1
-2
-3
-4
3
-1
0
-1
3
-3
-4
-2
-1
0
-2
2
W
-4
-4
-4
-6
-5
-5
-5
-3
-4
-5
-6
-3
-2
-2
-5
-1
-6
Y
-3
-3
-3
-5
1
-3
-3
5
2
-1
-3
5
7
3
0
5
-3
V
4
2
1
-5
-5
5
4
-1
-2
1
-1
2
1
-2
0
-3
-4
Several different architectures
• Sequence-to-structure
Output:
– Window sizes 15,17,19 and 21 C C H H C C C
– Hidden units 50 and 75
– 10-fold cross validation => 80 predictions
• Structure-to-structure
Output:
– Window size 17
CCCCCCC
– Hidden units 40
– 10-fold cross validation => 800 predictions
The majority rules
• Combining predictions from several
networks improves the prediction
• Combinations of 800 different networks were
used in the method described by
Petersen TN et al. 2000, Prediction of protein secondary
structure at 80 % accuracy. Proteins 41 17-20
Activities to probabilities
Helix
Strand
Coil
activities (output)
activities (output)
probabilities! (calculated)
Coil conversion
0.05
0.05
0.10
0.15
.
.
.
1.0
0.1
0.99
0.15
…
0.9
0.83
0.75
1.0
Benchmarking secondary
structure predictions
• EVA
– Newly solved structures are send to prediction
servers.
– Every week
http://cubic.bioc.columbia.edu/eva/sec/res_sec.html
EVA results (Rost et al., 2001)
•
•
•
•
•
•
PROFphd
PSIPRED
SAM-T99sec
SSpro
Jpred2
PHD
77.0%
76.8%
76.1%
76.0%
75.5%
71.7%
– Cubic.columbia.edu/eva
Links to servers
• Several links:
http://cubic.bioc.columbia.edu/eva/doc/explain_methods.html#typ
e_sec
• ProfPHD
http://www.predictprotein.org/
• PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/
• JPred
http://www.compbio.dundee.ac.uk/~www-jpred/
• SAM T02
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html
Practical Conclusions
• If you need a secondary structure prediction use
the newer methods based on advanced machine
learning methods such as :
–
–
–
–
ProfPHD
PSIPRED
JPred
SAM T02
• And not one of the older ones such as :
– Chou-Fasman
– Garnier

Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –

Transcript Protein Secondary Structures Assignment and prediction Pernille Andersen 23.04.2007 Outline • What is protein secondary structure • How can it be used? • Different prediction methods –

Directory