Document

Transcript Document

ReCombinatorics: Phylogenetic
Networks with Recombination
Two recent results and Two Open Questions
CPM, June 18, 2008
Pisa, Italy
What is population genomics?
• The Human genome “sequence” is done.
• Now we want to sequence many individuals
in a population to correlate similarities and
differences in their sequences with genetic
traits (e.g. disease or disease susceptibility).
• Presently, we can’t sequence large numbers
of individuals, but we can sample the
sequences at SNP sites.
SNP Data
• A SNP is a Single Nucleotide Polymorphism - a site in the
genome where two different nucleotides appear with
sufficient frequency in the population (say each with 5%
frequency or more).
• SNP maps have been compiled with a density of about 1
site per 1000.
• SNP data is what is mostly collected in populations - it is
much cheaper to collect than full sequence data, and
focuses on variation in the population, which is what is of
interest.
Haplotype Map Project:
HAPMAP
• NIH lead project ($100M) to find common SNP
haplotypes (“SNP sequences”) in the Human population.
• Association mapping: HAPMAP used to try to associate
genetic-influenced diseases with specific SNP haplotypes,
to either find causal haplotypes, or to find the region near
causal mutations.
• The key to the logic of Association mapping is historical
recombination in populations. Nature has done the
experiments, now we try to make sense of the results.
The Perfect Phylogeny Model for
SNP sequences
Only one mutation per site
allowed.
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
When can a set of sequences be
derived on a perfect phylogeny?
Classic NASC: Arrange the sequences in a
matrix. Then (with no duplicate columns),
the sequences can be generated on a unique
perfect phylogeny if and only if no two
columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
M
10100
10000
01011
01010
00010
10101 added
Pair 4, 5 fails the four
gamete-test. The sites 4, 5
are incompatible.
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
Network with Recombination:
ARG
M
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
A Min ARG for Kreitman’s data
ARG created by
SHRUB
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
An illustration of why we are
interested in recombination:
Association Mapping of
Complex Diseases Using
ARGs
Association Mapping
• A major strategy being practiced to find genes
influencing disease from haplotypes of a subset of
SNPs.
– Disease mutations: unobserved.
• A simple example to explain association mapping
and why ARGs are useful, assuming the true ARG is
known.
Disease mutation site
0
1
0
SNPs
0
1
Very Simplistic Mapping the Unobserved
Mutation of Mendelian Diseases with ARGs
The single disease
00000
mutation occurs
4
00010
between
a:00010 sites 2 and
3
What part of 01100 3!10010
1
d, e, f inherit?
00100
1 2 3 4 5
2
b:10010
d:
01100
P
S
e:
c:00100
P
3
f:
? ?
Where is the
disease mutation?
d:10100
Assumption (for now):
A sequence is diseased
iff it carries the single
disease mutation
5
00101
S
4
01101
g:00101
f:01101
e:01100
Diseased
Mapping Disease Gene with
Inferred ARGs
• “..the best information that we could possibly
get about association is to know the full
coalescent genealogy…” – Zollner and
Pritchard, 2005
• But we do not know the true ARG!
• Goal: infer ARGs from SNP data for
association mapping
– Not easy and often approximation (e.g. Zollner and
Pritchard)
– Improved results to do the inference Y. Wu (RECOMB 2007)
Results on Reconstructing the
Evolution of SNP Sequences
• Part I: Clean mathematical and algorithmic results: Galled-Trees, nearuniqueness, graph-theory lower bound, and the Decomposition
theorem, Forest Theorem and the History Lower bound.
• Part II: Practical computation of Lower and Upper bounds on the
number of recombinations needed. Construction of (optimal)
phylogenetic networks; uniform sampling; haplotyping with ARGs;
LD mapping …
• Part III: Varied Biological Applications
• Part IV: Extension to Gene Conversion
• Part V: The Minimum Mosaic Model of Recombination (CPM 2007)
This talk will discuss two topics in Part I
Minimizing Recombinations in
unconstrained networks
• Problem: given a set of sequences M, find a phylogenetic network
generating M, minimizing the number of recombinations used to
generate M, allowing only one mutation per site. This has biological
meaning in appropriate contexts.
• The general minimization problem is NP-hard.
• We can solve this problem in poly-time for the special case of GalledTrees, to be defined.
The Decomposition Theorem
Since the minimization problem is NP-hard
we want to break up a problem into
subproblems that can be solved separately
and combined.
Incompatible Sites
A pair of sites (columns) of M that fail the
4-gametes test are said to be incompatible.
A site that is not in such a pair is compatible.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
Incompatibility Graph G(M)
4
1
3
2
5
Two nodes are connected iff the pair
of sites are incompatible, i.e, fail the
4-gamete test.
THE MAIN TOOL: We represent the pairwise
incompatibilities in a incompatibility graph.
The connected components of
G(M) are very informative
For example we have the Theorem:
The number of non-trivial connected components is a lower-bound on the
number of recombinations needed in any network. We will see
that the non-trivial connected components are the key to the finest
possible decomposition, and have other essential uses.
Recombination Cycles
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
A maximal set of intersecting
cycles forms a Blob
00010
10010
00000
4
3
1
00100
2
P
S
3
01100
p
5
00101
S
4
01101
If directions on the edges are removed, a blob is
a bi-connected component of the network.
Blobed Trees
• Contracting each blob in a network results in a directed,
rooted tree, otherwise one of the “blobs” was not maximal.
Simple, but key insight.
• So every phylogenetic network can be viewed as a directed
tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
A blob that is just a single cycle is called a “gall”, and a
network where all blobs are galls is called a ``GalledTree”.
Every network is a
tree of blobs.
Ugly tangled
network inside
the blob.
A network
where every blob is a
single cycle is a
Galled-Tree.
A Simple Observation
In any network N for M, all sites from the
same non-trivial connected component of
G(M) must appear together in a single blob
in N.
The Decomposition Theorem
Theorem: For any set of sequences M, there is a
phylogenetic network that derives M, where each blob
contains all and only the sites in one non-trivial connected
component of G(M). The compatible sites can always be
put on edges outside of any blob. This
“fully-decomposed” network is the finest decomposition
possible.
Example: Network for input M
with one blob
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
The fullydecomposed
network for M
4
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Moreover, the backbone tree and the
partition of sites into blobs, and the
sequences exported from any blob are all
invariant features of the fully-decomposed
networks for M, and can be determined in
polynomial-time.
So, we can find a network for M by solving
the (rooted) recombination minimization
problem for each connected component of
G(M) separately, and then connect those
subnetworks in an invariant way.
The resulting network will be a network
with the fewest recombination nodes over
all fully-decomposed networks for M.
Algorithmically
• Finding the tree part of the blobbed-tree is easy.
• Determining the sequences labeling the exterior nodes on any blob is
easy.
• Determining a “good” structure inside a blob B is the problem of
generating the sequences of the exterior nodes of B.
• It is easy to test whether the exterior sequences on B can be generated
with only a single recombination. The original galled-tree problem is
now just the problem of testing whether one single-crossover
recombination is sufficient for each blob.
• That can be solved by successively removing each exterior sequence
and testing if the remaining sequences can be generated on a perfect
phylogeny of the correct form.
Proof Ideas
Let C be a connected component of G(M).
Define M[C] as the sequences in M
restricted to the sites in C.
M
12345
a 00010
b 10010
c 00100
d 10100
e 01100
f 01101
g 00101
4
C2
C1
1
3
B1
2
5
B2
134
a001
b101
c010
d110
e010
f 010
g010
25
00
00
00
00
10
11
01
a
b
c
d
e
f
g
M[C1]
M[C2]
Now for each connected component C in G(M), call each distinct
sequence in M[C] a supercharacter, and let W be the indicator
matrix for the supercharacters. So W indicates which rows of M
contain which particular supercharacters.
1
2
3
4
3
3
3
134
a001
b101
c010
d110
e010
f010
g 01 0
M[C1]
5
5
5
5
6
7
8
a
b
c
d
e
f
g
25
00
00
00
00
10
11
01
M[C2]
a
b
c
d
e
f
g
12345678
10001000
01001000
00101000
00011000
00100100
00100010
00100001
W
Proof Ideas
Lemma: No pair of supercharacters are
incompatible.
So by the NASC for a Perfect Phylogeny,
there is a unique perfect phylogeny T for W.
Proof Ideas
For each connected component C of G(M), all
supercharacters that originate from C label edges in T that
are incident with one single node v[C] in T. So, if we
expand each node v[C] to be a network that generates the
supercharacters from C (the sequences in M[C]), and
connect each network correctly to the edges in T, the
resulting network is a fully-decomposed blobbed-tree that
generates M.
However …
While fully-decomposed networks always exist, they
do not necessarily minimize the number of
recombination nodes, over all possible networks.
That is, sometimes it pays to put sites from different
connected components together on the same blob.
Sufficient Conditions
But we can prove several useful sufficient conditions
for when there is a fully-decomposed network that minimizes the
number of recombinations, over all possible networks.
The deepest result:
Theorem: Let N be a phylogenetic network for input M, let L be the
set of sequences that label the nodes of N, and let G(L) be the
incompatibility graph for L. If G(L) and G(M) have the same number
of connected components, then there is a fully-decomposed network for
M with the same number of recombinations as in N.
JCB December 2007
Corollary
A fully-decomposed network exists that
minimizes the number of recombinations,
unless every optimal network uses some
recombination node(s) labeled by sequence(s)
not in M, and the addition of those sequences
to M creates an incompatibility between sites
in different components of G(M).
000000
4
3
5
Sequences in M are in black.
Sequence 100010 is not in M.
1
G(M) has two
components. Each
requires two recs, but
this combined network
needs only three.
4
s
p 6
2
100010
p
001000
3 s 5 s
p
000100
011010
010010 100001
100101
G(L) has one component. The addition of
sequence 100010
reduces the number of components from 2 to 1.
G(M) for the original data
1
2
3
4
5
6
Two components, so two blobs,
each blob requires two recombs,
by the HK lower bound theorem,
so a fully decomposed networks needs
at least four recombinations
1
2
3
4
5
6
G(L) created from the original data,
and the addition of the new
interior sequence 100010.
G(L) has only one connected
component compared to two
components for G(M).
A Practical Sufficient Condition
If M can be derived on a network N in which
every edge contains at most
one site, and every node is labeled with a
sequence in M, then there is a
fully-decomposed network for M which
minimizes the number of recombinations
over all possible networks for M.
Another Practical Sufficient
Condition
If M can be derived on a network N where
the number of recombinations equals the
(poly-computable) Haplotype Lower Bound,
then there is a fully decomposed network
for M which minimizes the number of
recombinations over all possible networks.
Theorem: For any K, there is a dataset where best fully-decomposed
network uses K recombinations more than optimal.
In that construction, the ratio of the number of recombinations
in the best fully-decomposed network to the optimal is constant as
K grows.
Open Question: Construct examples where the show that the
ratio can be arbitrarily large.
A New Recombination Lower
Bound and The Minimum Perfect
Phylogenetic Forest Problem
Yufeng Wu and Dan Gusfield
UC Davis
COCOON’07 July 16, 2007
History
Bound
(Myers & Griffiths 2003)
M
000
000
00
00
100
100
10
10
Empty.
010
010
01
01
011
011
01
One type-3
operation
111
Iterate the following operations
1. Remove a column with a single 0 or 1
2. Remove a duplicate row
3. Remove any row
History bound: the minimum number of type-3
operations needed to reduce the matrix to empty
Graphical interpretation of
history bound (HistB)
• Each operation in the history computation corresponds to an operation
that deconstructs the optimal, but unknown ARG.
• We deconstruct the optimal ARG by removing tree parts as long as
possible; then remove an exposed recombination node; repeat.
• Removing an exposed recombination node
in the ARG corresponds to a single type-3 operation. So when
deconstructing the optimal ARG, the number of recombination nodes =
number of type-3 operations.
• Since the optimal ARG is unknown, the history bound is the minimum
number of type-3 operations needed to make the matrix empty.
4
3
M
1
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
a: 00010
b: 10010
c: 00100
d: 10100
e: 01100
f: 00101
g: 00101
Operations on M
correspond to operations
on the optimal
ARG
2
e: 01100
5
g: 00101
f: 00101
12345
4
a: 00010
b: 10010
c: 00100
d: 10100
e: 01100
f: 00101
3
1
p s
a: 00010
Type-2 operation
2
b: 10010
c: 00100
d: 10100
2
5
e: 01100
f: 00101
134
4
a: 001
b: 101
c: 010
d: 110
e: 010
f: 010
3
1
p s
a: 001
Type-1 operations
2
c: 010
b: 101
d: 110
e: 010
f: 010
134
4
3
1
Type-2 operations
p s
a: 001
a: 001
b: 101
c: 010
d: 110
2
c: 010
b: 101
d: 110
134
4
3
1
a: 001
b: 101
c: 010
Type-3 operation
a: 001
b: 101
c: 010
Then three more Type-1 operations fully reduce M
and the ARG.
History bound
• Initially required trying all n! permutations of the rows to
choose the type-3 operations.
• The bound can be computed by DP in O(2n) time (Bafna,
Bansal).
• On datasets where it can be computed, the history bound is
observed to be higher than (or equal to) all studied lower
bounds (about ten of them).
• There is no static definition for what the history bound is -it is only defined by the algorithms that compute it! The
work in this part of the talk comes out of an attempt to find
a simple static definition.
Why a static definition matters
• We want a definition of what is being computed,
independent of how it is computed, so that we can reason
about it and find alternative ways to compute or
approximate it.
• For example, with no static definition of the history bound,
we don’t know how to formulate an integer linear program
to compute it.
Intro. to Forest Bound: Decompose an
Optimal ARG to A Forest of Trees,
removing recombination edges
The number of trees is
precisely the number of
recombinations plus one
An ARG with three
recombinations
After removing recombination
edges, four trees result.
Idea behind the Forest Bound
(FB)
Each tree created in this way contains at most
one occurrence of any site, and each site occurs
in at most one of the trees. So the trees form a
forest of related perfect phylogenies.
Forest Bound
Given a set of sequences M, partition M into
the fewest subsets so that each subset of
sequences can be derived on a tree, where
each site occurs at most once in the forest of
trees.
The number of trees, minus one, is a valid
lower bound on Rmin.
Comparing the Forest Bound
(FB) to:
• History Bound (HistB)
• Optimal Haplotype Bound (OhapB): The currently best
lower bound that can be computed in practice for
biological data.
• Theorem: On any data, OhapB <= FB <= HistB
On some data, OhapB < FB < HistB
Thus the FB is the highest lower bound with a static
definition.
Computing the Forest Bound
is NP-Hard
• Optimal haplotype bound is quite good,
but NP-hard to compute.
• If the forest bound can be efficiently
computable, we do not need to use
optimal haplotype bound at all.
• Unfortunately, the forest bound is NPhard to compute.
• Reduction from Exact-cover-by-3 sets.
Integer Programming
Formulation for the Forest Bound
• For sequences with m sites, consider the hypercube all possible
2m sequences.
• Minimizing F is equivalent to reducing the number of Steiner
nodes in the forests.
• We also need to ensure the edge linking two nodes in a tree is
only labeled with columns that do not appear in other trees.
• Can easily incorporate the missing data in the input.
• The IP formulation has exponential size, but practical when the
number of columns is relatively small.
Empirical Results
• On random generated dataset with 15 rows and 7 columns, FB >
OhapB on 10% of the data. On more biological meaningful data
(generated with simulation program ms), however, OhapB= FB
more often.
• On dataset generated by ms with missing entries, FB is more
often outperforms an approximate optimal Rh bound:
– 30 rows and 7 columns and 30% missing entries: FB was strictly
larger in 8% of the data.
– When the level of missing entries is lower, the approx. OhapB
matches the FB more often.
Open Problem
Find a static definition of the history
bound, one that can be translated into an
objective function independent of any
algorithm; one that can be solved by ILP,
for example.
Papers and software are at:
wwwcsif.cs.ucdavis.edu/~gusfield
Thank you.

Document

Transcript Document

Directory