Predicting local Protein Structure Morten Nielsen Use of local structure prediction • • • • • • • Classification of protein structures Definition of loops (active sites) Relevant sites for mutagenesis Use in.

Download Report

Transcript Predicting local Protein Structure Morten Nielsen Use of local structure prediction • • • • • • • Classification of protein structures Definition of loops (active sites) Relevant sites for mutagenesis Use in.

Predicting local
Protein Structure
Morten Nielsen
Use of local structure prediction
•
•
•
•
•
•
•
Classification of protein structures
Definition of loops (active sites)
Relevant sites for mutagenesis
Use in fold recognition methods
Improvements of alignments
Definition of domain boundaries
Disease associated SNP’s
Protein Secondary Structure
Secondary Structure Elements
ß-strand
Helix
Bend
Turn
Helix formation is local
THYROID hormone receptor
(2nll)
i
i+4
-sheet formation is NOT local
Secondary Structure Type Descriptions
• H = alpha helix
• G = 310 - helix
• I = 5 helix (pi helix)
• E = extended strand, participates in beta ladder
• B = residue in isolated beta-bridge
• T = hydrogen bonded turn
• S = bend
• C = coil (the rest)
Automatic assignment programs
DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )
STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )
DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )
•
•
•
#
RESIDUE AA STRUCTURE BP1 BP2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
E
H
V
I
I
Q
A
E
F
Y
L
N
P
D
Q
S
G
E
F
M
F
D
F
D
G
D
E
E
E
E
E
E
E
E
E
T
T
T
T
E
E
E
E
E
E
E
E
T
T
E
E
-A
-A
-A
+A
+A
-A
-A
>> -A
45S+
45S+
45S<5 +
< +A
-A
-A
+A
-AB
-AB
-AB
> S-AB
3 S3 S+
< S-B
-B
DSSP
0
0
0
23
22
21
20
19
18
17
16
0
0
0
0
11
10
9
8
7
6
5
4
0
0
23
22
0
0
0
0A
0A
0A
0A
0A
0A
0A
0A
0
0
0
0
0A
0A
0A
0A
30A
29A
27A
26A
0
0
0A
0A
ACC
205
127
66
106
74
86
18
63
31
36
24
54
114
66
132
44
28
14
3
0
45
6
76
74
20
114
8
N-H-->O
O-->H-N
N-H-->O
O-->H-N
0, 0.0
2,-0.3
0, 0.0
0, 0.0
2, 0.0
2,-0.4 21, 0.0 21, 0.0
-2,-0.3 21,-2.6
2, 0.0
2,-0.5
-2,-0.4
2,-0.4 19,-0.2 19,-0.2
17,-2.8 17,-2.8 -2,-0.5
2,-0.9
-2,-0.4
2,-0.4 15,-0.2 15,-0.2
13,-2.5 13,-2.5 -2,-0.9
2,-0.3
-2,-0.4
2,-0.3 11,-0.2 11,-0.2
9,-1.5
9,-1.8 -2,-0.3
2,-0.4
-2,-0.3
2,-0.4
7,-0.2
7,-0.2
5,-3.2
4,-1.7 -2,-0.4
5,-1.3
-2,-0.4 -2, 0.0
2,-0.2
0, 0.0
0, 0.0 -1,-0.2
0, 0.0 -2, 0.0
2,-0.1 -2,-0.2
1,-0.1
3,-0.1
-4,-1.7
2,-0.3
1,-0.2 -3,-0.2
-5,-1.3 -5,-3.2
2, 0.0
2,-0.3
-2,-0.3
2,-0.3 -7,-0.2 -7,-0.2
-9,-1.8 -9,-1.5 -2,-0.3
2,-0.4
12,-0.4 12,-2.3 -2,-0.3
2,-0.3
-13,-2.5 -13,-2.5 -2,-0.4
2,-0.4
8,-2.4
7,-2.9 -2,-0.3
8,-1.0
-17,-2.8 -17,-2.8 -2,-0.4
2,-0.5
3,-3.5
3,-2.1 -2,-0.4 -19,-0.2
-21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1
-22,-0.3
2,-0.4
1,-0.2 -1,-0.3
-3,-2.1 -3,-3.5 109, 0.0
2,-0.3
-2,-0.4 -5,-0.3 -5,-0.2
3,-0.1
TCO
0.000
-0.987
-0.995
-0.976
-0.972
-0.910
-0.852
-0.933
-0.967
-0.994
-0.929
-0.884
-0.963
0.752
0.936
-0.877
-0.893
-0.979
-0.982
-0.983
-0.934
-0.948
-0.947
0.904
0.291
-0.822
-0.525
KAPPA ALPHA
PHI
PSI
360.0 360.0 360.0 113.5
360.0-152.8-149.1 154.0
4.6-170.2-134.3 126.3
13.9-170.8-114.8 126.6
20.8-158.4-125.4 129.1
29.5-170.4 -98.9 106.4
11.5 172.8-108.1 141.7
4.4 175.4-139.1 156.9
13.3-160.9-160.6 151.3
16.5-156.0-136.8 132.1
11.7-122.6-120.0 133.5
84.3
9.0-113.8 150.9
125.4 60.5 -86.5
8.5
89.3-146.2 -64.6 -23.0
51.1 134.1 52.9 50.0
28.9 174.9-124.8 156.8
15.9-146.5-151.0-178.9
5.0-169.6-158.6 146.0
27.8 149.2-139.1 120.3
39.7-127.8-152.1 161.6
23.9-164.1-112.5 137.7
6.9-165.0-123.7 138.3
78.4 -27.2-127.3 111.5
128.9 -46.6 50.4 45.0
118.8 109.3 84.7 -11.1
71.8-114.7-103.1 140.3
24.9-177.7 -74.1 127.5
X-CA
5.7
9.4
11.5
15.0
16.6
19.9
20.7
23.4
24.4
27.2
28.0
29.7
32.0
33.0
33.3
32.1
29.6
28.0
26.5
24.5
21.7
18.9
16.4
13.4
15.4
18.4
21.8
Y-CA
42.2
41.3
38.4
37.6
34.9
33.0
31.8
29.4
27.6
25.3
24.8
22.0
21.6
25.2
24.2
27.7
28.7
31.5
32.2
35.4
37.0
38.9
41.3
42.1
41.4
43.4
41.8
Z-CA
25.1
24.7
23.5
24.5
22.4
23.0
19.5
18.4
15.3
14.1
10.4
8.6
6.8
7.6
11.2
12.3
14.8
16.7
20.1
20.6
22.6
20.8
22.3
20.2
17.0
18.1
19.1
Prediction of protein secondary structure
• What to predict?
• How to predict?
• How good are the best?
Secondary Structure Prediction
• What to predict?
– All 8 types or pool types into groups?
DSSP
*
*
*
H = alpha helix (31%)
G = 310 -helix (3.5%)
I = 5 helix (pi helix) (<0.1%)
*
*
E = extended strand (21%)
B = beta-bridge (1%)
E
*
*
*
T = hydrogen bonded turn (11%)
S = bend (9%)
C = coil (23%)
C
H
Secondary Structure Prediction
• What to predict?
– All 8 types or pool types into groups
Straight HEC
*
H = alpha helix
*
E = extended strand
H
E
*
*
*
*
*
*
T = hydrogen bonded turn
S = bend
C = coil
G = 310-helix
I = 5 helix (pi helix)
B = beta-bridge
C
Secondary Structure Prediction
• Simple alignments
• Align to a close homolog for which the structure has been
experimentally solved.
• Heuristic Methods (e.g., Chou-Fasman, 1974)
• Apply scores for each amino acid an sum up over a window.
• Neural Networks (different inputs)
•
•
•
•
Raw Sequence (late 80’s)
Blosum matrix (e.g., PhD, early 90’s)
Position specific alignment profiles (e.g., PsiPred, late 90’s)
Multiple networks balloting, probability conversion, output
expansion (Petersen et al., 2000).
The pessimistic point of view
Prediction by alignment
Simple Alignments
• Solved structure of a homolog to query is
needed
• Homologous proteins have ~88% identical (3
state) secondary structure
• If no close homologue can be identified
alignments will give almost random results
Improvement of accuracy
1974 Chou & Fasman
1978 Garnier
1987 Zvelebil
1988 Quian & Sejnowski
1993 Rost & Sander
1997 Frishman & Argos
1999 Cuff & Barton
1999 Jones
2000 Petersen et al.
~50-53%
63%
66%
64.3%
70.8-72.0%
<75%
72.9%
76.5%
77.9%
Secondary structure predictions
of 1. and 2. generation
• single residues
(1. generation)
– Chou-Fasman, GOR
50-55% accuracy
• segments
– GORIII
55-60% accuracy
• problems
1957-70/80
(2. generation)
1986-92
– < 100%
they said: 65% max
– < 40%
they said: strand non-local
– short segments
Amino acid preferences in a-Helix
Amino acid preferences in -Strand
Amino acid preferences in coil
Chou-Fasman
Name
Ala
Arg
Asp
Asn
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
37
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
f(i)
0.06
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
Chou-Fasman
1. Assign all of the residues in the peptide the appropriate set of parameters.
2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) >
100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four
contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the
helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) >
P(b-sheet) for that segment, the segment can be assigned as a helix.
3. Repeat this procedure to locate all of the helical regions in the sequence.
4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(bsheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of
four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of
the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the
average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if
the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(ahelix) for that region.
6. To identify a bend at residue number j, calculate the following value:
p(t) = f(j)f(j+1)f(j+2)f(j+3)
where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the
f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) >
1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) <
P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
Chou-Fasman
• General applicable
• Works for sequences with no solved
homologs
• But the accuracy is low!
– 50%
Improvement of accuracy
1974 Chou & Fasman
1978 Garnier
1987 Zvelebil
1988 Quian & Sejnowski
1993 Rost & Sander
1997 Frishman & Argos
1999 Cuff & Barton
1999 Jones
2000 Petersen et al.
~50-53%
63%
66%
64.3%
70.8-72.0%
<75%
72.9%
76.5%
77.9%
PHD method
(Rost and Sander, 1993!!)
• Combine neural networks with sequence profiles
– 6-8 Percentage points increase in prediction accuracy over
standard neural networks (63% -> 71%)
• Use second layer “Structure to structure” network to
filter predictions
• Jury of predictors
• Set up as mail server
Sequence profiles
Neural Networks
• Benefits
– General applicable
– Can capture higher order correlations
– Inputs other than sequence information
• Drawbacks
– Needs many data (different solved
structures).
• However, these does exist today (nearly 5000
solved structures with low sequence identity/high
resolution.)
– Complex method with several pitfalls
How is it done
• One network (SEQ2STR) takes sequence
(profiles) as input and predicts secondary
structure
– Cannot deal with SS elements i.e. helices are
normally formed by at least 5 consecutive
amino acids
Architecture
Weights
Input Layer
IK
EE
H
VI
HE
C
IQ
AE
Hidden Layer
Window
IKEEHVIIQAEFYLNPDQSGEF…..
Output Layer
Example
PITKEVEVEYLLRRLEE (Sequence)
HHHHHHHHHHHHTGGG. (DSSP)
ECCCHEEHHHHHHHCCC (SEQ2STR)
How is it done
• One network (SEQ2STR) takes sequence (profiles) as
input and predicts secondary structure
– Cannot deal with SS elements i.e. helices are normally formed by
at least 5 consecutive amino acids
• Second network (STR2STR) takes predictions of first
network and predicts secondary structure
– Can correct for errors in SS elements, i.e remove single helix
prediction, mixture of strand and helix predictions
Secondary networks
(Structure-to-Structure)
Weights
Input Layer
HE
CH
E
CH
EC
Window
HE
C
Hidden Layer
IKEEHVIIQAEFYLNPDQSGEF…..
Output Layer
Example
PITKEVEVEYLLRRLEE
HHHHHHHHHHHHTGGG.
ECCCHEEHHHHHHHCCC
CCCCHHHHHHHHHHCCC
(Sequence)
(DSSP)
(SEQ2STR)
(STR2STR)
Slide courtesy by B. Rost 2004
Prediction accuracy PHD
Slide courtesy by B. Rost 2004
Stronger predictions more accurate!
PSI-Pred (Jones)
• Use alignments from iterative sequence
searches (PSI-Blast) as input to a neural
network (Just like PHDsec)
• Better predictions due to better
sequence profiles
• Available as stand alone program and via
the web
Petersen et al. 2000
• SEQ2STR (>70 networks)
– Not one single network architecture is best
for all sequences
• STR2STR (>70 network)
• => 4900 network predictions,
– (wisdom of the crowd!!!)
– Others have 1
Why so many networks?
Why not select the best?
Prediction accuracy (Q3=81.2%). 2006.
(Petersen et al. 2000)
Spectrin homology domain (SH3)
HEADER
COMPND
SOURCE
AUTHOR
CYTOSKELETON
ALPHA SPECTRIN (SH3 DOMAIN)
CHICKEN (GALLUS GALLUS) BRAIN
M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE
CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC
.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE..
93%
Prediction of protein secondary structure
•
•
•
•
•
•
1980: 55%
1990: 60%
1993: 70%
2000: 76%
2006: 80%
2008: >80%
simple
less simple
evolution
more evolution
more evolution
more evolution
Links to servers
• Database of links
http://mmtsb.scripps.edu/cgi
bin/renderrelres?protmodel
• ProfPHD
http://www.predictprotein.org/
• PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/
• JPred
http://www.compbio.dundee.ac.uk/~www-jpred/
Surface exposure
What is Accessible Solvent Area?
•
Surface area
accessible to a
rolling water
molecule
RSA
RSA = Relative Solvent Accessibility
ACC = Accessible area in protein structure
ASA = Accessible Surface Area
in Gly-X-Gly or Ala-X-Ala
Classification: Buried = RSA < 25 %, Exposed = RSA > 25 %
“Real” Value: values 0 - 1, RSA > 1 set to 1
Method
Neural Network - Input
•
Position Specific Scoring Matrices, PSSM
B
A
A
A
B
•
H
G
Y
V
E
2BEM.A
2BEM.A
2BEM.A
2BEM.A
2BEM.A
1
2
3
4
5
A
-4
-2
-1
-1
-2
R
-3
-5
1
-5
-4
N
-2
-3
-4
-5
-3
D
-4
-4
-3
-6
0
C
-6
-5
-5
-4
-4
Q
-2
-4
-4
-4
-1
E
-3
-5
-4
-5
3
G
-5
7
-4
-5
-2
H
11
-5
1
-5
-4
I
-6
-7
-4
4
0
L
-5
-6
-1
1
-3
K
-3
-4
-4
-5
-2
M
-4
-5
-1
6
1
F
-4
-6
2
-3
-2
P
-5
-5
-5
-2
-3
Secondary Structure predictions
B
A
A
A
B
H
G
Y
V
E
2BEM.A
2BEM.A
2BEM.A
2BEM.A
2BEM.A
1
2
3
4
5
0.003
0.018
0.020
0.021
0.020
0.003
0.086
0.199
0.271
0.199
0.966
0.868
0.752
0.679
0.752
S
-3
-3
0
-2
3
T
-4
-4
-1
0
3
W
-5
-5
4
-5
-5
Y
-1
-6
7
-4
-4
V
-6
-6
-2
4
0
Wisdom of the crowd
– Selecting best performing network
architectures based on test performance
• Better than choosing any single network
10-fold % correct predictions Av erage of set A-J w. sec. structure
79.80
79.75
79.75
79.75
79.75
79.74
79.75
79.76
79.77
79.77
79.76
79.75
79.76 79.75 79.75
79.76
79.77 79.77
79.72
79.69
79.70
79.66
% correct
79.65
79.60
79.55
79.55
79.50
79.45
79.40
S eries1
A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of A verage of
top 1
top 2
top 3
top 4
top 5
top 6
top 7
top 8
top 9
top 10
top 11
top 12
top 13
top 14
top 15
top 16
top 17
top 18
top 19
top 20
79.55
79.66
79.69
79.72
79.75
79.75
79.75
79.74
79.75
79.76
79.77
79.77
79.76
Series1
Ensemble size
79.75
79.76
79.75
79.75
79.76
79.77
79.77
Results - Real Value networks
• Training / Evaluation
Train
Evaluated
Method
Ahmad et al. (2003)
Not Published
0.48
ANN
Yuan and Huang (2004)
Not Published
0.52
SVR
Nguyen and
Rajapakse(2006)
Not Published
0.66
Two-Stage SVR
Dor and Zhou (2007)
0.738
Not Published
ANN
NetSurfP
0.722
0.70
ANN
Accuracy of predictions
• Prediction methods will always give an
answer
– A given method will predict that 25% of the
residues in a protein are exposed
• But can you trust these predictions?
• Use benchmarking to give average prediction
accuracy on a method evaluated on large
independent data set.
• But what about residue/single prediction
specific reliability?
Reliability (one real value target value)
E  w  (t  o)    (1 w)
2
Optimal value for :
=0 => w =0;
=∞ => w =1;

Input layer
Hidden layer
One target value
per input, but two
output values!
o = 0.55
w = 0.8
Output layer
Performance
Net Surf P <RSA > Spine <RSA >
all
0 .7 0 2
0 .2 8 6 0 .7 0 2 0 .2 6 7
T op 8 0 %
0 .7 2 9
0 .2 7 8 0 .7 0 8 0 .2 3 1
T op 5 0 %
0 .7 6 5
0 .2 7 6 0 .7 2 3 0 .1 8 4
T op 2 0 %
0 .7 8 9
0 .3 1 2 0 .7 3 0 0 .1 6 4
NetSurfP
NetSurfP
Conclusions
• The big break through in SS prediction came due to
sequence profiles
– Rost et al.
• Prediction of secondary structure has not changed
in the last 5 years
– More protein sequences => higher prediction accuracy
– No new theoretical break through
• Accuracy is close to 80% for globular proteins
• If you need a secondary structure prediction use
one of profile based:
– PSIPRED, and NetSurfP
• Amino acids exposure can be predicted with high
accuracy (80%)
– NetSurfP and Real-Spine