Transcript Talk
Efficient Computation of
Minimum Recombination With
Genotypes (Not Haplotypes)
Yufeng Wu and Dan Gusfield
University of California, Davis
CSB 2006
1
Haplotypes/Genotypes
• Diploid organisms have two copies of (not
identical) chromosomes. A single copy is a
haplotype, vector of 0,1. The mixed
description is a genotype, vector of 0,1,2. At
each site,
– If both haplotypes are 0, genotype is 0
– If both haplotypes are 1, genotype is 1
– If one is 0 and the other is 1, genotype is 2
• Key fact: easier to collect genotypes, but
many downstream applications work better
with haplotypes
2
Haplotyping
Sites: 1 2 3 4 5 6 7 8 9
Haplotype
Genotype
Phasing the 2s
0 1 1 1 0 0 1 1 0
0 1 0
1
1 1 0 1 0 0 1 0 0
1 1 1
0
2 1 2 1 0 0 1 2 0
2 1 2 1 0 0 1 2 0
Haplotype Inference (HI) Problem: given a set of n
genotypes, infer the real n haplotype pairs that form the
given genotypes
3
Two-stage Approach
• Given a set of genotypes G, we are interested
in downstream problems
• Many HI solutions for G
• Two stage: first infer the “correct” HI solution
from the genotypes, then do the downstream
analysis with the inferred haplotypes
• Haplotype inference: extensively studied and
believed to be accurate to certain extent
4
One-stage Approach
• What effect does the haplotyping
inaccuracy has on downstream
questions?
• Our work: directly use genotype data for
downstream problems
– Without fixing a choice for the HI solution
– Minimum recombination problem
5
Recombination: Single
Crossover
• Recombination is one of the principle genetic
force shaping variation within species
• Two equal length sequences generate a third
equal length sequence
110001111111001
11000 0000001111
Prefix
000110000001111
Suffix
breakpoint
6
Kreitman’s Data (1983)
0000000011000000001101110111100000000000000
0010000000000000001101110111100000000000000
0000000000000000000000000000000000010000101
0000000000000000110000000000000000010011000
0001100010110011110000000000000000001000000
0010000000000001000000000000001010111000010
0010000000000001000000000000011111101000000
1111100010111001000000000000011111101100000
1111100010111001000000000000011111101100000
1111100010111001000000000000011111101100000
1111111110000101000010001000011111101000000
Question: what is the minimum number of
recombinations needed to derive these sequences?
Assume at most 1 mutation per site
7
Minimizing Recombination
• Compute the minimum number of
recombinations (Rmin) for deriving
a set of haplotypes, assuming at
most 1 mutation per site
– NP-hard in general
– Heuristics
– Lower bounds on Rmin
8
Lower Bounds on Genotypes
• For a particular recombination lower bound
method L, what is the range of possible
bounds for L over all possible HI solutions?
– MinL(G): minimum L over all HI solutions for G.
– MaxL(G): maximum L over all HI solutions for G.
• This paper: HK bound, connected component
bound and relaxed haplotype bound.
– Polynomial-time algorithms for MaxHK, MinCC.
– Heuristic method for relaxed haplotype bound.
9
Lower Bound: Incompatibility
12345
Incompatibility Graph (IG):
a 00010
A node each site, edge
b10010
between incompatible pair
c 00100
M
d10100
e 01100
f 01101
g00101
1 2 3 4 5
• Two sites (columns) p, q are incompatible if columns
p,q contains all four ordered pairs (gametes): 00, 01,
10, 11
• Sites p,q are incompatible A recombination must
occur between p,q
10
HK Bound (1985)
• Arrange the nodes of the
incompatibility graph on the
line in order that the sites
appear in the sequence.
• HK bound = maximum number
of non-overlapping edges in
incompatibility graph (IG).
• Easy to compute for haplotype
data.
1
2
3
4
5
HK Lower Bound = 1
11
IG for HI Solutions
HI1
01010
10101
00202
22200
HI2
01010
01010
10101
10101
00000
00101
01000
10100
01010
01010
10101
10101
00001
00100
00000
11100
HK = 1
1
2
3
4
5
HK = 3
1
2
3
4
5
12
HK Bounds on Genotypes
• Known efficient algorithm for MinHK(G)
(Wiuf, 2004).
• This paper: polynomial-time algorithm
for MaxHK(G)
13
Maximal Incompatibility Graph
G
01010
10101
00202
22200
MIG(G)
1
2
3
4
5
E(G) = {12, 23,
35}
• An edge between sites p and q if there is a
phasing of p, q so p and q are incompatible
– Each pair of sites is considered independently
• E(G): a maximum-sized set of nonoverlapping edges in MIG(G)
14
MaxHK(G)
• Claim: MaxHK(G) = |E(G)|
• MaxHK(G) |E(G)|
– MIG(G): supergraph of IG(H) for any HI solution H
• If we can find an HI solution H, whose every
pair of sites in E(G) is incompatible, then
HK(H) |E(G)|
• Together, MaxHK(G) = |E(G)|
15
Finding such an H
MIG(G)
• Phase sites from left to right.
• Each component in E(G) is a simple path
• Each site only constrained by at most one site to the left
Phasing G for Incompatibility
01010
01010
10101
10101
00?0?
00?0?
0??00
1??00
01010
01010
10101
10101
00?0?
00?0?
00?00
11?00
01010
01010
10101
10101
0010?
0000?
00000
11100
• No matter how a previous site p is phased, can always
phase this site q to make p, q incompatible
Haplotyping With Minimum
Number of Recombinations
•
Compute Rmin(G)
– Haplotyping on a network with fewest
recombinations
•
•
•
NP-hard
This paper: A branch and bound method
computing exact Rmin(G) for data with small
number of sites
APOE data: 47 non-trivial genotypes, 9 sites
– Our method: 2 minutes, Rmin(G) = 5
18
Application: Recombination
Hotspot
•
•
•
Recombination hotspot: regions where
recombination rate is much higher than
neighboring regions
Previous study (Bafna and Bansal, 2005): a
recombination lower bound with inferred
haplotypes were used to identify
recombination hotspots
Our work: compute the exact Rmin(G) with
genotypes for a sliding window of a small
number of SNPs to detect recombination
hotspots
19
MS32 data (Jeffreys, et al.
2001)
Result from haplotypes
(Bafna and Bansal, 2005)
Result from original
genotypes (this paper)
20
Other Applications
• Finding true Rmin from genotypes G
– Two stage approach: run PHAS to get an HI
solution H, and compute Rmin(H)
– One stage approach: directly compute
Rmin(G)
• Accuracy of haplotype inference on a
minimum network
• Simulation results: comparable, slightly
weaker and non-conclusive
21
Summary
• Main goal of this paper: develop
computational tools for the minimum
recombination problem with genotypes
– Polynomial-time algorithm for MaxHK and MinCC
problems
– Practical heuristics for other problems
– Simulation results to several application questions
are not conclusive
– Our tools facilitate the study of these problems
22
Thank You
• Software: available upon request
23