Finding conserved regions in sequence alignments

Download Report

Transcript Finding conserved regions in sequence alignments

Phylogenetic Analysis
Introduction
• Intension
– Using powerful algorithms to reconstruct the
evolutionary history of all know organisms.
• Phylogenetic tree
– It can help understand the evolutionary
relationships among species of organisms.
– But we have to infer the evolutionary history of
current organisms.
2
Campanulaceae
(bluebell) family
Herpesviruses
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages
A
B
C
Represent the
TAXA (genes,
populations,
species, etc.)
used to infer
the phylogeny
D
Ancestral Node
or ROOT of
the Tree
Internal Nodes or
Divergence Points
(represent hypothetical
ancestors of the taxa)
E
4
Three types of trees
Cladogram
Phylogram
6
Taxon B
Taxon C
Taxon A
Taxon D
no meaning
1
1
3
1
5
Ultrametric tree
Taxon B
Taxon B
Taxon C
Taxon C
Taxon A
Taxon A
Taxon D
Taxon D
genetic change
time
All show the same evolutionary relationships, or branching orders, between the taxa.
5
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B
Taxon C
Taxon A
Taxon D
No meaning to the
spacing between the
taxa, or to the order in
which they appear from
top to bottom.
Taxon E
This dimension either can have no scale (for ‘cladograms’),
can be proportional to genetic distance or amount of change
(for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.
The goal of phylogeny inference is to resolve the
branching orders of lineages in evolutionary trees:
Completely unresolved
or "star" phylogeny
Partially resolved
phylogeny
A
A
A
B
C
E
C
E
C
D
B
B
E
D
D
Polytomy or multifurcation
Fully resolved,
bifurcating phylogeny
A bifurcation
7
There are three possible unrooted trees for
four taxa (A, B, C, D)
Tree 1
Tree 2
Tree 3
A
C
A
B
A
B
B
D
C
D
D
C
Phylogenetic tree building (or inference) methods are aimed at
discovering which of the possible unrooted trees is "correct".
We would like this to be the “true” biological tree — that is, one
that accurately represents the evolutionary history of the taxa.
However, we must settle for discovering the computationally
correct or optimal tree for the phylogenetic method of choice.
C-B Stewart, NHGRI lecture, 12/5/00
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A
B
# Taxa ( N)
C
A
B
C
A
C
B
D
D
E
A
B
C
F
D
E
3
4
5
6
7
8
9
10
.
.
.
.
30
# Unrooted trees
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
Å3.58 x 10
36
(2N - 5)!! = # unrooted trees for N taxa
(2N- 3)!! = # rooted trees for N taxa 9
Introduction
• NP-Hard optimization problem
– Unrooted trees # of n organisms = TU(n)
– Edges # of unrooted trees of n organisms = E(n)
= 2n-3 , n>=2
n-1
n
– TU(n) = TU(n-1)*E(n-1) = ΠE(i) = Π(2i-5)
i=2
i=3
– Ex. x
y
y x
y x
x
y
add t
t
t
z
z
z
– Rooted trees # of n organisms = TR(n)
= TU(n)*E(n) = TU(n+1)
t
z
10
Inferring evolutionary relationships between
the taxa requires rooting the tree:
B
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and
tug on it until the ends of
the string (the taxa) fall
opposite the root:
Root
D
Unrooted tree
A
A
Note that in this rooted tree, taxon A is
no more closely related to taxon B than
it is to C or D.
C
B
C
D
Rooted tree
Root
11
Now, try it again with the root at another position:
B
C
Root
Unrooted tree
D
A
A
B
C
D
Rooted tree
Root
Note that in this rooted tree, taxon A is most
closely related to taxon B, and together they
are equally distantly related to taxa C and D.
12
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees
A
The unrooted tree 1:
4
1
B
Rooted tree 1a
2
Rooted tree 1b
C
5
D
3
Rooted tree 1c
Rooted tree 1d
Rooted tree 1e
B
A
A
C
D
A
B
B
D
C
C
C
C
A
A
D
D
D
B
B
These trees show five different evolutionary relationships among the taxa!
13
All of these rearrangements show the same evolutionary
relationships between the taxa
Rooted tree 1a
B
A
C
D
A
C
A
D
D
C
B
B
C
D
D
C
A
A
B
B
B
B
C
D
D
A
C
A
14
Molecular phylogenetic tree building methods:
Are mathematical and/or statistical methods for inferring the divergence
order of taxa, as well as the lengths of the branches that connect them.
There are many phylogenetic methods available today, each having
strengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHOD
Characters
Distances
DATA TYPE
Optimality criterion
Clustering algorithm
PARSIMONY
MAXIMUM LIKELIHOOD
MINIMUM EVOLUTION
UPGMA
LEAST SQUARES
NEIGHBOR-JOINING
15
parsimony
• model complexity vs. sample size
• minimize Hamming distance summed over
all edges of the tree
• justification: minimum possible number of
evolutionary events
• subject of serious dispute by systematic
biologists
16
Method
– Maximum parsimony (MP)
• Seek the tree that minimizes the total number of
evolutionary events on the edges of tree
• Ex.
AAA
1
AAA
AAG
1
AAA
1
AGA
AAA
GGA
1
AAA
AGA
1
AAA
AAA
AGA
AAG
2
1
AAA
GGA
AAA
2
AAA
GGA
AAG
1
AGA
AAA
• Require two algorithms
– Search over tree topology
– The computation of a cost for a given tree
17
maximum likelihood
• estimate probability that a specific
evolutionary model will produce a
particular phylogeny yielding the observed
sequences
• many evolutionary models
18
Method
– Maximum likelihood (ML)
• Seek the tree that maximizes likelihood P(data|tree)
• Ex.
root
– Compute likelihood
P(x1,x2,x3|T,t1,t2,t3,t4)
X5
t4
•
– x : a set of sequences
X4
t3
t2
– T: a tree
t1
X2
– t•: edge lengths of tree
X1
• Require two algorithms
X3
– Search over tree topology
– Search over all possible lengths of edges t• to compute
likelihood
19
Distance Matrix Methods
• produce a tree such that the path distance between leaves i
and j (sum of edge weights in the path between i and j)
equals Dij
• this the additive property for a distance matrix -- of course
real distance matrices may not be additive
• most methods use agglomerative clustering -- successively
choosing pairs of nodes to combine
20
Ultrametric trees
• path distance from the root to each leaf is
the same
• strong molecular clock assumption distance is proportional to evolutionary time
21
Example Tree and Additive
Matrix
A
3
3
2
2
5
1
d
a
e
1
b
3
A
B
C
D
E
B
0
C
10
0
D
12
4
0
E
10
4
6
0
7
13
15
13
0
c
22
Distance Matrix Methods
•
•
•
•
•
•
UPGMA
Neighbor Joining
Fitch Margoliash
Quartet Puzzling
Witness-Anitwitness
Double Pivot
many are “not yet in use by the systematic biology
community”
23
Distance Measures
•
•
•
•
DNA hybridization amounts
immunological distances
genetic distances
sequence distances
(DNA, RNA, protein)
24
…what distance?
• need distance measure that reflects the
actual number of point mutations on the
path between the leaves
• particular problem with sequence data Hamming distance and assumption of no
reversals
25
UPGMA
• Unweighted Pair-Group Method with
Arithmetic mean
26
UPGMA Step 1
combine B and C
A
d
a
e
b
A
B
C
D
E
B
0
C
10
0
D
12
4
0
E
10
4
6
0
7
13
15
13
0
c
27
UPGMA step 2
combine BC and D
(10+12)/2
A
d
A
BC
D
E
BC
0
D
11
0
E
10
5
0
7
14
13
0
a
2
e
b
2
c
(4+6)/2
28
UPGMA step 3
combine A and E
A
A
BCD
E
2.5
0.5
BCD E
0 10.5
7
0 13.5
0
d
a
2
e
b
2
c
29
UPGMA step 4
combine AE and BCD
AE
AE
BCD
BCD
0
12
0
2.5
3.5
3.5
.5
d
a
2
e
b
2
c
30
UPGMA Result
A
2.5
1.5
3.5
2.5
3.5
3.5
.5
d
a
2
e
b
A
B
C
D
E
B
0
C
10
0
D
12
4
0
E
10
4
6
0
7
13
15
13
0
2
c
31
UPGMA Result
1.5
3.5
2.5
3
3
2
2
5
1
d
a
1
e
b
2.5
3.5
3.5
.5
d
a
3
2
c
e
b
2
c
32
Method
• Phylogenetic reconstruction techniques
– NJ (neighbor-joining method)
• A star tree is successively inserted branches between a
pair of closest neighbors and the remaining terminals in
the tree
• Character
– The fastest reconstruction method
– Poor accuracy when the distance matrix contains
large value
33
Method
• Ex.
S1 S2 S3 S4
S1 0 4 4 3 S1
S2
0 6 5
S3
0 2
S2
S4
0
Distance matrix
3.67
4
5
3.33
X
S3 S1
X
S4 S2
Star tree
Pair S1 and S2
– The cost save by pairing S1 and S2 =
New connection cost (NC) – Old connection cost (OC) = 2.34
NC = ½(average(S1)+average(S2)+d(S1,S2))=6.33
OC = average(S1) +average(S2) = 8.67
– The largest cost save by pairing S3 and S4 = 2.67
Thus we pair S3 and S4
S1
S3
S2
X
S4
34
Neighbor-Joining Result
d
2
3
3
6
2
2
5
1
d
a
1
e
b
1.5
2
5
a
3
1
c
e
b
3
c
35
Genome Rearragement
– Generalized Nadean-Tayor (GNT) evolution model
• P(transpostion) = α
• P(inverted trans.) = β
• P(inversion) = 1-(α+β)
• events # on edge :
according to Poisson
distribution
λx•e-3
f(x) =
; x=1,2,..
x!
Genome rearrangement
36
Improving reconstruction algorithms
37
Improving reconstruction algorithms
– Estimators of true evolutionary distance
• Exact-IEBP (inverting the expected breakpoint distance)
ML estimate of the breakpoint distance after K
rearrangements
• Approx-IEBP
approximate Exact-IEBP
• EDE (empirically derived estimator)
empirical estimate of the inversion distance after K
rearrangements
produced a nonlinear regression formula that
computes the expected distance given that K random
rearrangements
38
Conclusion
• New generation of phylogenetic software needs
–
–
–
–
More sophisticated models of evolution
Faster optimization algorithms
High performance algorithm engineering
Powerful modes of user interaction
39