Building phylogenetic trees

Download Report

Transcript Building phylogenetic trees

Randomized Algorithms for Three
Dimensional Protein Structures
Comparison
Yaw-Ling Lin
Dept Computer Sci and Info Engineering,
Providence University, Taiwan
E-mail: [email protected]
WWW: http://www.cs.pu.edu.tw/~yawlin
1
Outline
•
•
•
•
•
•
•
Introduction
Protein Structures
3D structure comparisons
Algorithms
Benchmarking
Comparing with other systems
Future Works
2
Introduction
3
What are proteins ?
• Structural framework (keratin, collagen)
• Transport and storage of small molecules
(hemoglobin)
• Transmit information (hormones, receptors)
• Antibodies
• Blood clotting factors
• Enzymes
The protein is created in the cell as a unique sequence
of amino acids
AC M V L
L E
C
V
4
Sequence
ACMVLLCEVEKYP…
folding
Structure
Function
?????
5
Background and Problem definition
About
protein sequences are known
today (non-redundant database).
This number keeps rapidly growing
(large scale sequencing projects).
!
The function of 40-50% of the new proteins is unknown.
Understanding biological function is important for:
• Study of fundamental biological processes
• Drug design
• Genetic engineering
What bioinformatics can do for us?
7
Drug Discovery
• Target Identification
– Which protein to inhibit?
• Lead discovery & optimization
– What sort of molecule will bind to this protein?
• Toxicology
– Side effects, target specificity
• Pharmacokinetics
– Metabolization and transport
8
Drug Development Life Cycle
Discovery
(2 to 10 Years)
Preclinical Testing
(Lab and Animal Testing)
Phase I
(20-30 Healthy Volunteers used to
check for safety and dosage)
With the aid of bioinformatics
Phase II
(100-300 Patient Volunteers used to
check for efficacy and side effects)
Phase III
$600-700 Million!
(1000-5000 Patient Volunteers
used to monitor reactions to
long-term drug use)
FDA Review
& Approval
Post-Marketing
Testing
Years
0
2
4
6
8
10
12
7 – 15 Years!
14
16
9
Drug lead screening
5,000 to 10,000
compounds screened
5 Drug Candidates
enter Clinical Testing;
80% Pass Phase I
250 Lead Candidates in
Preclinical
Testing
30%Pass Phase II
80% Pass Phase III
One drug approved by the FDA
10
Drug Lead Screening & Docking
?
Complementarity
Shape
Chemical
Electrostatic
11
Protein Structures
12
Levels of structure in proteins
13
Myoglobin structure
14
Myoglobin structure contd.
15
Myoglobin in solution
16
Three dimensional structures of
cytochrome c, lysozyme and ribonuclease
17
PDB file format
18
PDB file format
19
PDB file format
20
PDB file format
21
Protein Structures
22
Rasmol-Structure
PDB: 101M
PDB: 2DHB
23
Rasmol-Group
PDB: 101M
PDB: 2DHB
24
Structural classifications
• SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
• CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
• FSSP http://www.ebi.ac.uk/dali/fssp/fssp.html
Structure comparison algorithms
•Dali
•CE
•Structal
•VAST
Contact matrix and the Dali
method
Contact matrix n  n matrix whe re n # residues
d (i, j )  distance(c # i, c # j )
Idea: Similar structures have similar contact matrices
26
From distance map to structural
similarities
• Imagine transparent distance map of one protein put on
to of a map of other protein (Liisa Holm Chris Sander J.
Mol. Biol. 23 3.):
– Matching patches centered on diagonal correspond to matching
secondary structures.
– Matches of short distances off diagonal correspond to tertiary
conformations.
– Similarity score
Unmatched residues do not contribute to score.
27
Contact matrix and the Dali
method
Contact matrix n  n matrix whe re n # residues
d (i, j )  distance(c # i, c # j )
Idea: Similar structures have similar contact matrices
28
DALI algorithm outline
• Step1: Consider all possible pairs of 6x6 submatrices
of the contact matrices. Such matrices are small
enough that the problem can be solved optimally.
• Step2: Assembly the alignments from step 1.
Method – Monte Carlo algorithm.
29
CE
(Shindyalov & Bourne, Protein Eng. 1998)
Protein Structure Alignment by Incremental Combinatorial
Extension (CE) of the Optimal Path
Define alignment fragment pair (AFP) as a continuous segment of protein A
aligned against a continuous segment of protein B (without gaps).
•An alignment is a path of AFPs s.t. for every two consecutive AFPs there
may be gaps inserted into either A or B, but not into both. That is, for every
two consecutive AFPs i and i+1
and
A
A
pi 1  pi  m
piB1  piB  m
or
and
piA1  piA  m
piB1  piB  m
or
and
A
A
B
B
p

p

m
p

p
m
i 1
i
i 1
where piA is the starting
position
of AFP i in protein
A i
CE
What is a “good”AFP?
Define the distance between two different AFPs i and j as:
1 m
Dij   d A ( piA  k  1, p Aj  m  k )  d B ( piB  k  1, p Bj  m  k )
m k 1
dA(p,q) represents the distance between the alpha carbon atoms at
positions p and q in protein A.
Protein B
Dij
i
j
i
j
Protein A
If you already have n-1 AFPs and consider adding the n-th AFN, do
so only if
1 n1
1 n n
(1) Dnn  D0
(2)
D

n 1
i 0
in
 D1
(3)
n
2
 D
i 0 j 0
ij
 D1
CE (cont.)
1.
2.
3.
Select an initial AFP.
Build an alignment path by incrementally adding “good”
AFPs that satisfy the conditions of paths
Repeat step (2) until the proteins are completely matched,
or until no good AFPs remain.
Protein B
Protein A
4.
To assess the significance of the alignment, compare it to
the alignment of a random pairs of structures, and compute
the Z-score based on the RMSD and number of gaps in the
final alignment.
Structal
(Levitt & Gerstein, PNAS 1998)
An initial equivalence is chosen, based on matching the ends of
the two structures.
Repeat until convergence:
•
Superimpose the two structures so as to minimize the RMS,
given the equivalence
•
Given the superposition, calculate the distances dij between
any atom i in the first protein and any atom j in the second
protein
•
Transform distances into similarities sij = M/[1+ (dij/d0)2]
where M=20 and d0 = 2.24A
•
Apply dynamic programming to define a new set of
equivalences
Structal (cont)
2) Superimpose to
minimize RMS
1) Alignment fixed
4) Use dynamic
prog. to find the
best set
of equivalences
5) Superimpose given
the new alignment
3) Calculate distances
between all atoms
6) Recalculate distances
between all atoms
Approach based on comparing
secondary structure arrangement
Motivation:
• Folds are often defined as
arrangement of secondary
structure elements (sse).
• Why not to compare
arrangement of sse rather
than going down to atomic
level?
1EJ9: Human topoisomerase
35
VAST- graph theoretical approach
•
http://www2.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
• Perform the comparison on the level of secondary structures
and not residues.
• Treat each secondary structure as a vector of direction and
length corresponding to the direction and length of the
secondary structure. Attributes of such vector include the
type of secondary structure, number of residues, etc.
• For two secondary structure provide a way of describing the
relative spatial position of secondary structures – distance,
angle, etc.
• VAST finds maximal subset of secondary structures that are
in the same relative positions in compared protein structures
and in the same order within the structure.
36
37
38
39
40
41
SCOP
Structural classification of proteins with 5 level hierarchy:
Domains: the individual entries
Family: homologous proteins with significant sequence similarity
Superfamily: protein families that share weak sequence similarity but
with conserved functional residues (e.g. in active sites) –
believed to be evolutionary related
Fold: protein superfamilies that share he same fold (not necessarily
due to common evolutionary ancestry)
Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane
proteins, small proteins
The classification is based on manual analysis by experts (Dr. Alexy
Murzin)
As of May 2002, 7 main classes, 686 folds, 1073 superfamilies,
1827 families
CATH
Structural classification of proteins with 5 level hierarchy:
Protein chains: the individual entries
Homologous superfamily: proteins with highly similar structures and
functions.
Topology: clusters according to the topological connections and
numbers of secondary structures.
Architecture: describes the gross orientation of secondary structures,
independent of connectivities (assigned manually).
Class: derived from secondary structure content, is assigned for
more than 90% of protein structures automatically.
The assignments of structures to topology families and homologous
superfamilies are made by sequence and structure
comparisons.
As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies,
more than 2000 superfamilies.
FSSP
Structural classification of proteins into a tree hierarchy:
Protein domains: the individual entries (defined using the
algorithm of Holm and Sander 1994)
Start with all-vs-all structure comparison of protein domains
Domains are clustered automatically into clusters using the
single linkage algorithm based on the z-scores of the
structure similarity scores
3242 families of more than 30,000 structures as of June
2002
Algorithms
• Measurement: rmsd.
• Pair atoms of two structures by minimum
bipartite matching.
• Fix one structure, and keep several 3-D
orientations of the other.
• Randomly perturb these orientations, and
shift to better positions until converging.
• Report the best rmsd score and orientation.
45
INIT-S(N)
N=4
N=12
N=6
N=8
N=20
46
INIT-S(N)
47
MB-Align Algorithm
48
MB-Align Descriptions
49
3D Transformation
• 3D rotation is done around a rotation axis
• Fundamental rotations
 About x, y, or z axes
• Positive Rotation
 Counter-clockwise rotation (when you look down the
negative axis)
y
+
z
x
50
3D Transformation
• Rotation about Z
y
x’ = x cos(q) – y sin(q)
y’ = x sin(q) + y cos(q)
z’ = z
cos(q) -sin(q) 0
sin(q) cos(q) 0
0
0
1
0
0
0
0
0
0
1
+
x
z
• OpenGL - glRotatef(q, 0,0,1)
51
3D Transformation

y
Rotation about Y (z → x, x → y, y → z)
z’ = z cos(q) – x sin(q)
x’ = z sin(q) + x cos(q)
y’ = y
cos(q)
0
-sin(q)
0
z
0
sin(q) 0
1
0
0
0 cos(q) 0
0
0
1
x
+
• OpenGL - glRotatef(q, 0,1,0)
x
+
z
y
52
3D Transformation

y
Rotation about X (y → x, z → y, x → z)
y’ = y cos(q) – z sin(q)
z’ = y sin(q) + z cos(q)
x’ = x
1
0
0
0
z
0
0
0
cos(q) -sin(q) 0
sin(q) cos(q) 0
0
0
1
z
+
• OpenGL - glRotatef(q, 1,0,0)
x
+
y
x
53
3D Transformation
• Arbitrary rotation axis (rx, ry, rz)
• glRotatef(angle, rx, ry, rz)
So, which way is a positive rotation?
y
(rx, ry, rz)
x
z
54
Rotation
55
Rotation
56
Rotation
57
Rotation
58
Rotation Matrix
59
Perturbation
The orientation vector is perturbed to its neighborhood.
60
q
r, the normal vector.
61
Perturbation Algorithm
62
MB-Align Algorithm
63
System Implementations
• OS: Linux/Red Hat 7.2 run on Pentium-4
2800Mhz CPU and 1G bytes RAM.
• Bioperl – pdb file format conversion
• Rotation/perturbation/integration – C
programs
• Minimum bipartite matching – LEDA
• Rmsd - PROFIT
64
Benchmarking
65
Benchmarking
66
Benchmarks
67
Efficiencies of Strategies
Localdice : havea dice for each si  S
Global dice : share commondice for each si  S
68
The End.
69