Assignment and prediction April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Protein Secondary Structures.

Download Report

Transcript Assignment and prediction April 8, 2003 Claus Lundegaard CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU Protein Secondary Structures.

Assignment and prediction
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Protein Secondary Structures
•
•
•
•
•
April 8, 2003
Classification of protein structures
Definition of loops/core
Use in fold recognition methods
Improvements of alignments
Definition of domain boundaries
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Use of secondary structure
Claus Lundegaard
April 8, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Secondary Structure Elements
• Defining features
– Dihedral angles
– Hydrogen bonds
– Geometry
• Assigned manually by crystallographers or
• Automatic
– DSSP (Kabsch & Sander,1983)
– STRIDE (Frishman & Argos, 1995)
– Continuum (Andersen et al.)
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Classification of secondary structure
From http://www.imb-jena.de
phi
psi
omega
April 8, 2003
-
dihedral angle about the N-Calpha bond
dihedral angle about the Calpha-C bond
dihedral angle about the C-N (peptide) bond
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Dihedral Angles
phi(deg)
psi(deg) H-bond pattern
-----------------------------------------------------------------right-handed alpha-helix
-57.8
-47.0
i+4
pi-helix
-57.1
-69.7
i+5
3-10 helix
-74.0
-4.0
i+3
(omega is 180 deg in all cases)
----------------------------------------------------------------From http://www.imb-jena.de
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Alpha helices
phi(deg)
psi(deg)
omega (deg)
-----------------------------------------------------------------beta strand
-120
120
180
-----------------------------------------------------------------
Hydrogen bond patterns in beta sheets. Here a four-stranded
beta sheet is drawn schematically which contains three
antiparallel and one parallel strand. Hydrogen bonds are
indicated with red lines (antiparallel strands) and green lines
(parallel strands) connecting the hydrogen and receptor oxygen.
From http://broccoli.mfn.ki.se/pps_course_96/
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Beta Strands
*
*
*
*
*
*
*
April 8, 2003
H = alpha helix
B = residue in isolated beta-bridge
E = extended strand, participates in beta ladder
G = 3-helix (3/10 helix)
I = 5 helix (pi helix)
T = hydrogen bonded turn
S = bend
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Secondary Structure Types
•
DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )
•
STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )
#
RESIDUE
1
4 A
2
5 A
3
6 A
4
7 A
5
8 A
6
9 A
7
10 A
8
11 A
9
12 A
10
13 A
11
14 A
12
15 A
13
16 A
14
17 A
15
18 A
16
19 A
17
20 A
18
21 A
19
22 A
20
23 A
21
24 A
22
25 A
23
26 A
24
27 A
25
28 A
26
29 A
27
30 A
April 8, 2003
AA
E
H
V
I
I
Q
A
E
F
Y
L
N
P
D
Q
S
G
E
F
M
F
D
F
D
G
D
E
STRUCTURE BP1 BP2
0
0
0
0
0
0
E
-A
23
0A
E
-A
22
0A
E
-A
21
0A
E
+A
20
0A
E
+A
19
0A
E
-A
18
0A
E
-A
17
0A
E >> -A
16
0A
T 45S+
0
0
T 45S+
0
0
T 45S0
0
T <5 +
0
0
E
< +A
11
0A
E
-A
10
0A
E
-A
9
0A
E
+A
8
0A
E
-AB
7 30A
E
-AB
6 29A
E
-AB
5 27A
E > S-AB
4 26A
T 3 S0
0
T 3 S+
0
0
E < S-B
23
0A
E
-B
22
0A
ACC
205
127
66
106
74
86
18
63
31
36
24
54
114
66
132
44
28
14
3
0
45
6
76
74
20
114
8
N-H-->O O-->H-N N-H-->O O-->H-N
0, 0.0
2,-0.3
0, 0.0
0, 0.0
2, 0.0
2,-0.4 21, 0.0 21, 0.0
-2,-0.3 21,-2.6
2, 0.0
2,-0.5
-2,-0.4
2,-0.4 19,-0.2 19,-0.2
17,-2.8 17,-2.8 -2,-0.5
2,-0.9
-2,-0.4
2,-0.4 15,-0.2 15,-0.2
13,-2.5 13,-2.5 -2,-0.9
2,-0.3
-2,-0.4
2,-0.3 11,-0.2 11,-0.2
9,-1.5
9,-1.8 -2,-0.3
2,-0.4
-2,-0.3
2,-0.4
7,-0.2
7,-0.2
5,-3.2
4,-1.7 -2,-0.4
5,-1.3
-2,-0.4 -2, 0.0
2,-0.2
0, 0.0
0, 0.0 -1,-0.2
0, 0.0 -2, 0.0
2,-0.1 -2,-0.2
1,-0.1
3,-0.1
-4,-1.7
2,-0.3
1,-0.2 -3,-0.2
-5,-1.3 -5,-3.2
2, 0.0
2,-0.3
-2,-0.3
2,-0.3 -7,-0.2 -7,-0.2
-9,-1.8 -9,-1.5 -2,-0.3
2,-0.4
12,-0.4 12,-2.3 -2,-0.3
2,-0.3
-13,-2.5 -13,-2.5 -2,-0.4
2,-0.4
8,-2.4
7,-2.9 -2,-0.3
8,-1.0
-17,-2.8 -17,-2.8 -2,-0.4
2,-0.5
3,-3.5
3,-2.1 -2,-0.4 -19,-0.2
-21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1
-22,-0.3
2,-0.4
1,-0.2 -1,-0.3
-3,-2.1 -3,-3.5 109, 0.0
2,-0.3
-2,-0.4 -5,-0.3 -5,-0.2
3,-0.1
TCO
0.000
-0.987
-0.995
-0.976
-0.972
-0.910
-0.852
-0.933
-0.967
-0.994
-0.929
-0.884
-0.963
0.752
0.936
-0.877
-0.893
-0.979
-0.982
-0.983
-0.934
-0.948
-0.947
0.904
0.291
-0.822
-0.525
Claus Lundegaard
KAPPA ALPHA PHI
PSI
360.0 360.0 360.0 113.5
360.0-152.8-149.1 154.0
4.6-170.2-134.3 126.3
13.9-170.8-114.8 126.6
20.8-158.4-125.4 129.1
29.5-170.4 -98.9 106.4
11.5 172.8-108.1 141.7
4.4 175.4-139.1 156.9
13.3-160.9-160.6 151.3
16.5-156.0-136.8 132.1
11.7-122.6-120.0 133.5
84.3
9.0-113.8 150.9
125.4 60.5 -86.5
8.5
89.3-146.2 -64.6 -23.0
51.1 134.1 52.9 50.0
28.9 174.9-124.8 156.8
15.9-146.5-151.0-178.9
5.0-169.6-158.6 146.0
27.8 149.2-139.1 120.3
39.7-127.8-152.1 161.6
23.9-164.1-112.5 137.7
6.9-165.0-123.7 138.3
78.4 -27.2-127.3 111.5
128.9 -46.6 50.4 45.0
118.8 109.3 84.7 -11.1
71.8-114.7-103.1 140.3
24.9-177.7 -74.1 127.5
X-CA
5.7
9.4
11.5
15.0
16.6
19.9
20.7
23.4
24.4
27.2
28.0
29.7
32.0
33.0
33.3
32.1
29.6
28.0
26.5
24.5
21.7
18.9
16.4
13.4
15.4
18.4
21.8
Y-CA
42.2
41.3
38.4
37.6
34.9
33.0
31.8
29.4
27.6
25.3
24.8
22.0
21.6
25.2
24.2
27.7
28.7
31.5
32.2
35.4
37.0
38.9
41.3
42.1
41.4
43.4
41.8
Z-CA
25.1
24.7
23.5
24.5
22.4
23.0
19.5
18.4
15.3
14.1
10.4
8.6
6.8
7.6
11.2
12.3
14.8
16.7
20.1
20.6
22.6
20.8
22.3
20.2
17.0
18.1
19.1
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Automatic assignment programs
• What to predict?
Q3 into groups
– All 8 types or pool types
*
*
*
*
*
*
*
*
H = a helix
B = residue in isolated b-bridge
E = extended strand, participates in b ladder
G = 3-helix (3/10 helix)
I = 5 helix (p helix)
T = hydrogen bonded turn
S = bend
C/.= random coil
H
E
C
Straight
CASPHEC
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Secondary Structure Prediction
• Simple alignments.
• Heuristic Methods (e.g., Chou-Fasman, 1974)
• Neural Networks (different inputs)
– Raw Sequence (late 80’s)
– Blosum matrix (e.g., PhD, early 90’s)
– Position specific alignment profiles (e.g., PsiPred,
late 90’s)
– Multiple networks balloting, probability conversion,
output expansion (Petersen et al., 2000).
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Secondary Structure Prediction
1974 Chou & Fasman
1978 Garnier
1987 Zvelebil
1988 Quian & Sejnowski
1993 Rost & Sander
1997 Frishman & Argos
1999 Cuff & Barton
1999 Jones
2000 Petersen et al.
April 8, 2003
Claus Lundegaard
~50-53%
63%
66%
64.3%
70.8-72.0%
<75%
72.9%
76.5%
77.9%
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Improvement of accuracy
• Solved structures homologous to query
needed
• Homologous proteins have ~88%
identical (3 state) secondary structure
• If no homologue can be identified
alignment will give almost random results
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Simple Alignments
Claus Lundegaard
April 8, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Amino acid preferences in aHelix
Claus Lundegaard
April 8, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Amino acid preferences in bStrand
Claus Lundegaard
April 8, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Amino acid preferences in coil
Name
Ala
Arg
Asp
Asn
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
April 8, 2003
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b)
83
93
54
89
119
37
110
75
87
160
130
74
105
138
55
75
119
137
147
170
P(turn)
66
95
146
156
119
74
98
156
95
47
59
101
60
60
152
143
96
96
114
50
f(i)
0.06
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
Claus Lundegaard
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Chou-Fasman
1.
Assign all of the residues in the peptide the appropriate set of parameters.
2.
Scan through the peptide and identify regions where 4 out of 6 contiguous residues have
P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of
four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of
the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix)
> P(b-sheet) for that segment, the segment can be assigned as a helix.
3.
Repeat this procedure to locate all of the helical regions in the sequence.
4.
Scan through the peptide and identify a region where 3 out of 5 of the residues have a value
of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a
set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the
end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if
the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.
5.
Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be
helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet)
> P(a-helix) for that region.
6.
To identify a bend at residue number j, calculate the following value:
p(t) = f(j)f(j+1)f(j+2)f(j+3)
where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the
f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) >
1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) <
P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Chou-Fasman
• General applicable
• Works for sequences with no solved
homologs
• Low Accuracy
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Chou-Fasman
• Benefits
– General applicable
– Can capture higher order correlations
– Inputs other than sequence information
• Drawbacks
– Needs many data (different solved
structures)
– Risk of overtraining
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Neural Networks
Weights
Input Layer
IK
EE
H
VI
HE
C
IQ
AE
Hidden Layer
Window
IKEEHVIIQAEFYLNPDQSGEF…..
April 8, 2003
Claus Lundegaard
Output Layer
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Architecture
Inp Neuron
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
A
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
R
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
N
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
D
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Q
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
AAcid
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Sparse encoding
0
0
0
1
0
0
0
0
0
0
0
0
0
IQ
AE
IK
EE
HV
I
Claus Lundegaard
April 8, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Input Layer
0
0
0
0
0
0
0
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
April 8, 2003
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
Claus Lundegaard
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
B
-2
-1
3
4
-3
0
1
-1
0
-3
-4
0
-3
-3
-2
0
-1
-4
-3
-3
Z
-1
0
0
1
-3
3
4
-2
0
-3
-3
1
-1
-3
-1
0
-1
-3
-2
-2
X
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
0
0
-2
-1
-1
*
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
BLOSUM 62
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Input Layer
-1
0
IK
EE
HV
I
0
2
-4
2
5
-2
0
-3
IQ
AE
-3
1
-2
-3
-1
0
-1
-3
-2
-2
April 8, 2003
Claus Lundegaard
Weights
Input Layer
HE
CH
E
CH
EC
Window
HE
C
Hidden Layer
IKEEHVIIQAEFYLNPDQSGEF…..
April 8, 2003
Claus Lundegaard
Output Layer
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Structure to Structure
• Combine neural networks with sequence
profiles
– 6-8 Percentage points increase in prediction
accuracy over standard neural networks
• Use second layer “Structure to structure”
network to filter predictions
• Jury of predictors
• Set up as mail server
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
PHD method (Rost and Sander)
(BLAST profiles)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
April 8, 2003
I
K
E
E
H
V
I
I
Q
A
E
F
Y
L
N
P
D
A
-2
-1
5
-4
-4
-3
0
-3
-2
2
-1
-3
3
-1
-1
-2
-3
R
-4
-1
-3
-3
2
0
-2
0
-3
-4
3
-5
-5
-3
-4
4
-2
N
-5
-2
-3
2
1
-4
-4
-5
-2
-4
1
-5
-5
-4
4
-4
1
D
-5
-2
-3
5
1
-5
1
-5
-3
-3
1
-5
-6
-2
1
-4
5
C
-2
-3
-3
-6
-5
-4
-4
-4
-5
2
-1
-4
3
1
5
-5
-6
Q
-4
-1
3
1
1
-4
-2
-2
4
-3
0
-4
-4
5
-3
0
-2
E
-4
3
1
5
-2
-2
-4
-5
-1
-1
1
-4
-5
1
-4
-3
2
G
-5
-3
-2
-4
-4
-3
-4
-6
3
-4
-4
-1
-2
-1
2
3
2
H
-5
-2
-3
-3
9
-5
-5
1
5
-2
-3
-1
-1
-1
-4
2
-1
I
6
-2
-3
-6
-5
1
1
2
-5
1
-1
1
0
-1
-4
-5
-2
L
0
-3
-3
-6
-2
-2
0
4
-3
-1
-3
1
-4
1
-4
-4
-2
Claus Lundegaard
K
-4
4
-2
-2
-3
1
-2
-4
-3
-4
0
-5
-5
-3
-3
0
-3
M
0
-2
-2
-5
-4
0
0
-1
-4
-3
3
2
-3
-3
-2
-4
-5
F
-2
-4
-4
-6
-4
1
2
0
-2
-4
-5
5
3
1
-4
-3
-4
P
-4
-3
-3
-4
-5
-4
-5
-5
-4
1
4
-1
-5
-5
-5
0
-5
S
-4
1
-1
-2
-3
-3
1
-2
2
2
-1
-4
-2
-1
2
1
-1
T
-2
1
-2
-3
-4
3
-1
0
-1
3
-3
-4
-2
-1
0
-2
2
W
-4
-4
-4
-6
-5
-5
-5
-3
-4
-5
-6
-3
-2
-2
-5
-1
-6
Y
-3
-3
-3
-5
1
-3
-3
5
2
-1
-3
5
7
3
0
5
-3
V
4
2
1
-5
-5
5
4
-1
-2
1
-1
2
1
-2
0
-3
-4
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Position specific scoring matrices
• Use alignments from iterative sequence
searches (PSI-Blast) as input to a
neural network
• Better predictions due to better
sequence profiles
• Available as stand alone program and
via the web
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
PSI-Pred (Jones, DT)
• CASP
– Critical Assessment of Structure Predictions
– Sequences from about-to-be-solved-structures are
given to groups who submit their predictions
before the structure is published
• EVA
– Newly solved structures are send to prediction
servers.
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Benchmarking secondary structure
predictions
•
•
•
•
•
•
PROFphd 77.0%
PSIPRED
76.8%
SAM-T99sec 76.1%
SSpro
76.0%
Jpred2
75.5%
PHD
71.7%
– Cubic.columbia.edu/eva
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
EVA results (Rost et al., 2001)
Weights
Input Layer
IK
EE
H
VI
Window
HE
CH
EC
IQ
AE
HE
C
Hidden Layer
IKEEHVIIQAEFYLNPDQSGEF…..
April 8, 2003
Output Layer
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Output expansion
• Sequence-to-structure
– Window sizes
15,17,19 and 21
– Hidden units50 and 75
– 10-fold cross validation => 80 predictions
• Structure-to-structure
– Window size
17
– Hidden units40
– 10-fold cross validation => 800 predictions
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Several different architectures
• Confidence of a per residue prediction
– P(Highest) – P(second highest)
– H: 0.80 E: 0.05 C:0.15 => conf.=0.65
• Mean per chain confidence for all 800
predictions
– Calculate Mean and Standard deviation
– Averaging of per chain predictions with Z
>=2
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Balloting procedure
Coil conversion
Helix
Strand
Coil
0.05
0.10
0.15
.
.
.
1.0
April 8, 2003
activities
activities
probabilities
0.05
0.99
0.1
0.15
0.9
0.83
0.75
Claus Lundegaard
…
1.0
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Activities to probabilities
Sequence profiles as input
• Neural network technology
•
Balloting of large number of Neural Network predictions
(0.2%)
• Output expansion (0.5%)
• Probability transformation (1.2%)
•
EVA (400 low homology proteins)
April 8, 2003
Ranking
Group name
Q3 Performance
1
SBI-AT
77.2 %
2
PROFsec B.Rost
76.3 %
3
Psi-pred D.Jones
76.2 %
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Petersen et al., Proteins, 41: 17-20,
2000
• Database of links
– http://mmtsb.scripps.edu/cgibin/renderrelres?protmodel
• ProfPHD
– http://cubic.bioc.columbia.edu/
• PSIPRED
– http://bioinf.cs.ucl.ac.uk/psipred/
• JPred
– www.compbio.dundee.ac.uk/Software/JPred/jpred.
html
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Links to servers
• If you need a secondary structure prediction
use one of the newer ones such as
– ProfPHD,
– PSIPRED, and
– JPred
• And not one of the older ones such as
– Chou-Fasman, and
– Garnier
April 8, 2003
Claus Lundegaard
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Practical Conclusion