Combinatorial Approaches to Haplotype Inference

Download Report

Transcript Combinatorial Approaches to Haplotype Inference

Combinatorial Approaches to
Haplotype Inference
Dan Gusfield
CS, UC Davis
Three topics
• Super-charging Clark’s method of 1990 Two ideas: Min haps selection + Consensus
• Follow-on to Haplotyping By Perfect
Phylogeny - a more biological view, and
phase transitions, recombinations (?)
• Computing True Parsimony Solutions,
Clark/Parsimony hybrids - Integer
Programming
Topic I: Supercharging Clark’s
Method
S. Orzack, D. Gusfield, V. Stanton
Genetics, in revision
Generic Clark Method
Basic Step:
Given a known haplotype H (original homozygote or single-site
heterozygote, or previously inferred), and an unresolved
genotype G, if G can be explained by H and another vector H’,
then call H’ a known haplotype, available for additional inferrals.
example: H 0 1 0 0 1
G 21022
-----------------H’ 1 1 0 1 0
G is “resolved” by H and H’
In a single run, repeat the basic step until stuck - resolve as many
genotypes as possible in the data.
Clark (1990) Randomize choices, and do the computations many
times to find an execution (run) that explains the most genotypes.
Many variations of Clark
• Variations based on which parts are randomized.
• We closely examine eight variations on a real data
set. Variation 1 randomizes every decision probably more than Clark originally intended.
• Truth in advertising - we implemented our own
Clark versions - did not actually use Clark’s
software.
Super-charging Clark’s method
Why? The reported accuracy is significantly
less than that of PHASE and HAPLOTYPER.
Answers: 1) Scale
2) Deeper insight
Supercharging Clark’s Method
• How? Run many times (10,000) to get (generally)
a huge range of solutions, often 10,000 different
solutions. This is a good thing!
• Select the (few - in the 10’s) runs that use the
smallest number of distinct haplotypes over the
10,000 runs. Let S be that set of runs.
• Vote - for each individual find the haplotype pair
used most often in S. This gives the Consensus
solution.
Results on Lab Determined Data
• APOE locus
• 80 individuals, 9 polymorphic sites, 47
ambiguous after homozygotes and singlesite heterozygotes identified - lab
determined haplotypes
Variation 1
Average correct over all 10,000 runs: 29.0
24 runs that use 20 haplotypes (smallest observed)
or 21 distinct haplotypes
Average correct over those 24 runs: 36.1
0.9987 percentile of all the 10,000 runs
Correct in the Consensus solution from
those 24 runs: 39
which is equal to the best of the 10,000 executions
Variation 4
Average over all 10,000 runs: 18.3
Consensus number of correct: 42
Consensus Thresholds
The calls that are made with high frequency have high reliability.
We observed few or no false calls among the haplotype pairs that
were called with high frequency (85% up).
This identifies a subset of calls that can be believed with high
confidence, and a subset that has to be determined by other
means.
Bottom-up method to molecularly determine the haplotypes.
Comparison with Phase and
Haplotyper
• Phase consistently gets 42 correct. 1 false
call with confidence value > 0.9
• Haplotyper: high variation.
distribution of
accuracy
44
43
42
41
40
39
38
37
1
453
463
73
47
3
5
3
distribution of high distribution of calls that
confidence wrong don’t explain the
calls
genotype
0
1
2
3
4
39
1003
2
3
3
0
1
2
4
586
303
152
3
Topic 2: Haplotyping as Perfect
Phylogeny
The Perfect Phylogeny Model
We assume that the evolution of extant haplotypes can be displayed on a
rooted, directed tree, with the all-0 haplotype at the root, where each
site
changes from 0 to 1 on exactly one edge, and each extant haplotype is
created by accumulating the changes on a path from the root to a leaf,
where that haplotype is displayed.
In other words, the extant haplotypes evolved along a perfect phylogeny
with all-0 root.
The Perfect Phylogeny Model
sites 12345
Ancestral haplotype 00000
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
Justification for Perfect
Phylogeny Model
• In the absence of recombination each haplotype of
any individual has a single parent, so tracing back
the history of the haplotypes in a population gives
a tree.
• Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project. (See NYT October 30, 2002)
• Mutations are rare at selected sites, so are assumed
non-recurrent.
• Connection with coalescent models.
Solving the Haplotype Phylogeny
Problem (PPH) in nearly linear
time
Gusfield, RECOMB, April 2002
• Finds if there is a solution.
• Counts the number of solutions.
• Implicitly represents all the solutions
Program PPH
• Program PPH solves the perfect phylogeny
haplotyping problem using the graph
realization approach. It solves problems
with 50 sites and 100 individuals in about 1
second.
• Program PPH can be obtained at
www.cs.ucdavis.edu/~gusfield
An alternative solution
Recently, we developed an algorithm that is not based
on graph realization, and which is much easier to understand.
However, it runs in O(nm^2) time.
V. Bafna, D. Gusfield, G. Lancia, S. Yooseph
See Gusfield’s website.
The implicit representation: A
column partition
The algorithm partitions the sites (columns) into classes,
such that within any class, the phase relationship (in, or
out of phase)
of any two columns is INVARIANT over all perfect
phylogenies for the data.
Between any two classes, the phase relationship can be
set ARBITRARILLY. All solutions can be genetated
in this way.
That is the representation of the set of all solutions. The
number of solutions is always a power of two.
a
a
b
b
c
c
d
d
e
e
1
2
3
4
5
6
7
1
1
2
2
1
1
1
1
2
2
2
2
0
0
2
2
2
2
2
2
2
2
2
2
2
2
2
2
0
0
2
2
0
0
2
2
0
0
0
0
0
0
0
0
0
0
2
2
0
0
0
0
0
0
2
2
0
0
2
2
0
0
2
2
0
0
0
0
0
0
An example.
Each row starts
duplicated for
simplicity.
a
a
b
b
c
c
d
d
e
e
1
2
3
4
6
5
7
1
1
1
0
1
1
1
1
1
0
1
0
0
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
Starting from a PPH
Solution, if all
shaded cells in a
class switch value,
then the result is
also a PPH solution,
and any PPH
solution can be
obtained in this
way, i.e. by
choosing in each
block whether to
switch or not.
Secondary information and
optimization
• The partition shows explicitly what added information is useful and
what is redundant. Information about the relationship of a pair of
columns is redundant if and only if they are in the same class of the
column partition. Apply this successively as additional information is
obtained.
• Problem: Minimize the number of haplotype pairs (individuals) that
need be laboratory determined in order to find the correct tree.
• Minimize the number of (individual, site1, site2) triples whose phase
relationship needs to be determined, in order to find the correct tree.
The implicit representation of all solutions provides
a framework for solving these secondary problems,
as well as other problems involving the use of
additional information, and specific tree-selection
criteria.
A Phase-Transition
Problem, as the ratio of sites to genotypes changes,
how does the probability that the PPH solution is
unique change?
For greatest utility, we want genotype data where the
PPH solution is unique.
Intuitively, as the ratio of genotypes to sites increases,
the probability of uniqueness increases.
Frequency of a unique solution with 50
and 100 sites, 5% rule and 2500 datasets
per entry
# geno. Frequency of
unique solution
10
0.0018
20
0.0032
22
0.7646
40
0.7488
42
0.9611
70
0.994
130
0.999
140
1
10
20
22
40
42
60
100
110
0
0
0.78
0.725
0.971
0.983
0.999
1
Data with recombination
If the genotype data does not fit a perfect phylogeny,
we decompose the sites (greedily left to right) into
maximal intervals that do fit a unique perfect phylogeny.
Experimental results: number of intervals reduces to
about a tenth. The solutions are highly accurate in each
interval. Then we have a phase problem again between
the intervals.
The final haplotype determination can be made with one
site per interval, greatly reducing any lab effort.
Topic 3: The True Parsimony
Objective
For a set of genotypes, find a Smallest set H
of haplotypes, such that each genotype can
be explained by a pair of haplotypes in H.
No Approximations - find the true smallest, and know for
sure you have it !
Example of Parsimony
02120
00100
01110
22110
01110
10110
20120
00100
10110
3 distinct haplotypes
set S has size 3
Pure Parsimony is NP-hard
Earl Hubbel (Affymetrix) showed that Pure Parsimony
is NP-hard.
However, for a range of parameters of current interest
(50 sites and 50 genotypes) a True Parsimony
solution can be computed efficiently, using Integer
Linear Programming, and one speed-up trick.
For larger parameters (100 sites and 50 genotypes)
A near-parsimony solution can be found efficiently.
How Fast? How Good?
Depends on the level of recombination in the underlying
data. True Parsimony can be computed in seconds to
minutes for most cases with 50 genotypes and up to 60
sites, faster as the level of recombination increases.
As the level of recombination increases, the accuracy
of the True Parsimony Solution falls, but remains within
10% of the quality of PHASE (for comparison).
The APOE Data
There are 17 distinct haplotypes in the real data.
The IP finds a True Parsimony Solution with 15 distinct haplotypes.
PHASE and HAPLOTYPER each use 15 haplotypes.
Over the 10,000 executions, Clark variation 1 used
a minimum of 20 haplotypes.
Clark/Parsimony Hybrid
For low recombination, large (>60) sites
Find an execution of Clark’s method that
a) maximizes the number of genotypes resolved
b) minimizes the number of distinct haplotypes used
We can do this by mixing the Digraph View of Clark’s method
(Gusfield 2001) with the parsimony criteria, and truly find
an execution of Clark’s method that minimizes the number of
distinct haplotypes used.
On datasets where we can compute True Parsimony, this
hybrid does only a bit worse than True Parsimony.
Other uses of IP
On datasets where we know the solution, find the best
that a Clark method can ever do. IP can find the best
possible execution.
On the APOE data, Clark’s method can get all get 47 correct!
In fact in a huge number of ways. (But the best we found
by actually running Clark’s method was 42 correct).
This kind of test is not possible for Statistical methods.
The Conceptual Integer
Programming Formulation
For each genotype (individual) j, create one integer
programming variable Yij for each pair of haplotypes
whose merge creates genotype j. If j has k 2’s, then
This creates 2^(k-1) Y variables.
Create one integer programming variable Xq for
Each distinct haplotype q that appears in one of the
pairs for a Y variable.
Conceptual IP
For each genotype, create an equality that says that
exactly one of its Y variables must be set to 1.
For each variable Yij, whose two haplotypes are
given variables Xq and Xq’, include an inequality
that says that if variable Yij is set to 1, then both
variables Xq and Xq’ must be set to 1.
Then the objective function is to Minimize the
sum of the X variables.
Example
02120 Creates a Y variable Y1 for pair 00100 X1
01110 X2
and a Y variable Y2 for pair
01100 X3
00110 X4
Include the following (in)equalities into the IP
Y1 + Y2 = 1
Y1 - X1 <= 0
Y1 - X2 <= 0
Y2 - X3 <= 0
Y2 - X4 <= 0
The objective function will
include the subexpression
X1 + X2 + X3 + X4
But any X variable is included
exactly once no matter how many
Y variables it is associated with.
Efficiency Tricks
Ignore any Y variable and its two X variables if those X
variables are associated with no other Y variable. The
Resulting IP is much smaller, and can be used to find
the optimal to the conceptual IP.
Also, we need not enumerate all X pairs for a given
genotype, but can efficiently recognize the pairs we
need.
Avoiding Enumeration of
unneeded haplotypes
For each pair of genotypes, G1, G2 it is easy to find all the
haplotypes that appear in an explanation for G1 and in
an explanation for G2.
Example: 0 2 1 1 0 2 0 2
01112202
0 1 1 1 0 V 0 2 V and then generate all combinations
of 0,1’s over the V sites.
So the time is O(m x # haps in both explanation sets)
Recombination Helps!
As the level of recombination increases, the number of
haps in two explanation sets decreases, so the size of
the IP falls.
Full paper on Parsimony
See the technical report: Haplotyping by Pure Parsimony
available at
wwwcsif.cs.ucdavis.edu/~gusfield/
Thanks to Andy, Sorin and Mike.