Transcript Slide 1

Haplotyping algorithms and
structure of human variation
EECS 458
CWRU
Fall 2004
Readings: see papers on the
course website
Roadmap
• Definition: haplotype and haplotype inference
• Why infer haplotypes
• Infer haplotypes from pedigree data
– Most probable haplotype configurations
– Haplotype configurations with minimum recombinations
• Infer haplotypes from population data
– Combinatorial: Clark’s, Perfect Phylogeny
– Statistical methods: EM, Bayesian (MCMC)
• Infer haplotypes from pooled samples
• Haplotype block partition
• Tag SNP selection
Genotype and Haplotype
Paternal
.
.
.
A
A
T
G
C
C
G
C
A
A
.
.
.
G
T
C
.
.
.
.
.
.
A
G
T
G
C
C
G
C
A
A
.
.
.
T
A
C
.
.
.
{1 2}
{1 2}
Maternal
Typical Genotype Data
Observation:
• Two alleles for each
individual
– Chromosome origin for
each allele is unknown
• Multiple haplotype
pairs can fit observed
genotype
• Molecular haplotyping
is expensive
A
C
Marker1
G
A
Marker2
T
C
Marker3
Possible haplotypes:
A
C
A
C
A
C
A
C
G
A
G
A
A
G
A
G
T
C
C
T
T
C
C
T
Haplotypes are important!
• Phase may determine phenotype
• Phase helps exploit linkage disequilibrium
Infer state of neighboring alleles
• Phase clarifies identity-by-descent status
Common Uses of Haplotypes
• Linkage disequilibrium studies
– Summarize genetic variation
• Selecting markers to genotype
– Identify haplotype tag SNPs
• Candidate gene association studies
– Test haplotype associations
– Help interpret single marker associations
• Understanding evolution of human
populations
The problem…
• Haplotypes are hard to measure directly
– X-chromosome in males
– Sperm typing
– Other molecular techniques
• Often, statistical or combinatorial methods
for reconstruction required
Haplotype Inference on population
data
m
{1
{1
{1
{1
{1
{2
2}
1}
2}
2}
2}
2}
m=6, m’=4
1|2
1|1
1|2
1|2
1|2
2|2
2|1
1|1
1|2
1|2
1|2
2|2
1|2
1|1
2|1
1|2
1|2
2|2
2|1
1|1
2|1
1|2
1|2
2|2
2m’
……
2|1
1|1
2|1
2|1
2|1
2|2
Information on Relatives
• Number of ambiguous individuals
increases rapidly with number of markers
• Family information can help, but many
ambiguities remain
Haplotype Inference on
Pedigrees, Mendelian Law
{1 2}
{1 1}
{1 1}
{1 2} {2 2}
2|1
1|1
1|1
2|1
22
{1 2}
{1 2}
{1 2}
{1 1}
{1 2}
{1 2} {1 2}
{1 *}
{1 2} {1 2}
Haplotype inference on pooled
samples
• The input contain n pools
• Each pool contains k
individuals, thus 2k
haplotypes and m
markers
• At each marker, we are
given the number of
alleles for the k
individuals for each pool
• The goal is to find the
haplotype frequencies
• Example: n=3, k=2, m=5
2
4
3
2
2
0
2
3
1
2
1
2
2
2
3
Haptotyping pedigree data:
statistical formulation
• Statistical formulation: find the most
probable haplotype configuration
• Need to calculate the probability of a
pedigree on every haplotype configuration
• Recall for linkage analysis, we need to
calculate the probability of a pedigree, that
sums over all possible haplotype configs
Haptotyping pedigree data :
statistical formulation
• Thus the linkage programs like Genehunter,
Allegro, Merlin could compute the most probable
haplotypes
• But, it is time consuming….
• In addition to exact computation, there are some
approximation algorithms, mainly based on
important sampling, e.g. SimWalk.
• Still very time consuming, may consider many
configurations with very small probabilities
Recombination and combinatorial
formulation
{1 2}
{1 2}
{1 2}
{1 2}
1|2
{1 2}
1|2
{1 2}
1|2
1|2
1|1
1|2
{1 1}
{1 2}
{1 2}
{1 2}
1|1
{1 2}
{1 2}
{1 2}
1|2
1|2
1|1
1|2
1|2
1|2
1|2
1|2
1|2
1|2
1|2
2|1
MRHC Problem
Find a minimum recombinant
haplotype configuration
from a given pedigree with
genotype data.
Assumptions:
• Mendelian law (no mutations);
• Recombination events are rare.
Well supported from real data.
{1 2}{1 2}
{1 2}{1 2}
{1 2}{2 2}
…
…
{1 1} {1 2}
{1 2} {1 2}
{2 2} {1 2}
...
...
{1 2}
{1 2}
{1 2}
...
{1 1}
{1 2}
{2 2}
...
{1 1}
{1 2}
{2 2}
…
{1 2}
{2 2}
{2 2}
…
Input
MRHC Problem (cont’d)
• PS: parental source of the two
alleles at the locus (i.e. phase)
1|1 1|2
• GS: grandparental source of an
1|2 1|2
allele
2|2 2|1
1|2 1|2
1|2 2|1
2|1 2|2
…
…
...
...
A
1|2
1|2
2|1
...
1|1
1|2
2|2
...
B
1|2
2|2
2|2
…
GS2=1
GS2=1
GS2=0
Output
1|1
2|1
2|2
…
PS=0
• Haplotype configuration =
assignment of PS and GS values.
PS=1
Previous Results
• Genotype elimination (O’Connell’00).
– For data requiring no recombinant, exhaustive elimination.
• Genetic algorithm (Tapadar et al.’00).
– Time consuming.
• MRH (Qian & Beckmann’02).
– Six step rule-based algorithm.
– Locus by locus at every step, extremely slow for biallelic (e.g.
SNP) markers.
Thm. MRHC is NP-Hard.
Idea: Reduction from a variant
of set cover.
First complexity result.
Remains hard for two loci.
Remains hard when no loops.
Li & Jiang’03, Doi, Li & Jiang’03
Block-Extension Algorithm
Iterative, heuristic, five steps. Rules are derived from
Mendelian law, MR principle, block concept and some
greedy ideas based on the following observations:
•
•
•
•
Block structures are common in haplotypes.
Double recombination events are rare.
Common haplotype blocks shared in siblings.
…
Advantages/Disadvantages
Time complexity (BE: O(dmn) / MRH: O(2dm3n2))
Li & Jiang’03
Block-Extension Algorithm
1
11
12
23
34
2
1
**
**
**
**
3
4
12
13
32
42
21
24
34
32
5
33
32
6
24
1*
34
1
2
2
11
23
11
23
12
**
12
34
23
**
23
14
34
**
34
2*
3
4
12
13
32
42
21
24
34
32
53 3
32
6
13
24
34
3
4
12
13
32
42
21
24
34
32
5
33
32
6
13
42
42
24
24
24
32
32
32
42
24
34
Block-Extension Algorithm
1
2
1|1
2|3
1|1
2|3
1|2
34
1|2
34
23
14
23
14
34
2*
3
4
34
2*
1
5
1|2(-1,0)
1|3(-1,1)
1|3
2|3(1,-1)
2|4(1,-1)
32
2|1(-1,-1)
34
2|4(-1,-1)
3|2(-1,-1)
6
24
13
34
4|2(1,-1)
3
2
4
5
1|2(-1,0)
1|3(-1,1)
1|3
2|3(1,-1)
2|4(1,-1)
32
2|1(-1,-1)
2|4(1,-1)
34
3|2(-1,-1)
6
3|1(1,0)
24
34
4|2(1,-1)
24
4|2(1,-1)
2|3(1,-1)
2|3(1,-1)
Dynamic Programming Algorithms
• Locus-based dynamic programming algorithm
– Linear time in the number of the members
– Applicable to only tree pedigrees
• Member-based dynamic programming algorithm
– Linear time in the number of the loci
– Applicable to general pedigrees with small sizes
Doi, Li & Jiang’03
Locus-Based Dynamic
Programming
2
1
4
3
5
7
6
5
7
root
8
6
8
1
2
3
4
Constraint-Finding Algorithm
• Assumptions:
– No missing alleles, no errors.
– Zero recombinants.
• Idea: finding all feasible (i.e. 0-recombinant)
haplotype configurations is equivalent to
reducing the degree of freedom in PS/GS
assignment.
Li & Jiang’03
Four Levels of Constraints
Based on Mendelian law
(on single locus) :


Level 1: GS constraint
Level 2: PS constraint
Based on 0-recombinant
(for a pair of loci):


Level 3: Haplotype constraint
Level 4: Grouping constraint
1|1
1|2
2|2
...
1|2 1|2
1|2 2|1
2|1 2|2
…
…
1|2
1|2
2|1
...
A
1|2
1|2
2|1
...
1|1
1|2
2|2
...
B
1|2
2|2
2|2
…
GS2=1
1|1
2|1
2|2
…
PS=0
Level 3 and Level 4
Constraints
1
4
2
5
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
6
{1 2}
3
4
{1 2}
{1 2}
{1 2}
{1 2}
{1 2}
5
6
{1 1}
{1 1}
{1 2}
4
5
{1 2}
12
12
12
12
21
12
6
4
5
12
6
21
12
21
21
21
Level 3 and Level 4 Constraints
The variables represent PS values and the equations are over Z2
Analysis of Constraint-Finding
Algorithm
Thm. Every solution consistent with the constraint
equations is a feasible solution and vice versa.
• Steps:
– find all constraints, in the form of linear equations over Z2
– solve the equations by Gaussian elimination
– enumerate all feasible haplotype configurations
• Exact polynomial time (O(n3m3); genotype elimination: exponential)
Integer Linear Programming
• Combines missing data imputation and haplotype inference.
• Regardless of the pedigree structure, number of
recombinants, number of variables are linear of problem size.
• Implicitly checks the Mendelian consistency for pedigree
genotype data with missing alleles, which is also an NPC
problem.
• Could find all possible optimal solutions.
• Solved by a branch-and-bound algorithm.
• Effective for practical size problems in terms of time efficiency.
• Accurate in terms of missing alleles imputation and haplotype
inference.
Li & Jiang’04a
ILP for MRHC with Missing Data
1.
2.
3.
4.
5.
Define variables .
Define linear constraints.
Define a linear objective function of the variables.
Preprocess constraints.
Apply branch-and-bound strategy to find solutions. (a
partial order relationship and some other special
relationships).
6. Estimate bounds.
7. Apply a maximum likelihood approach to multiple
optimal solutions.
Formulation
Mj:={mk} set of all possible alleles at marker locus
j and let tj = |Mj|. M1 = {1, 2} , M2 = {1,2}
1
{1 2}
{1 2}
2
{1 0}
{1 2}
Define tj f vars for each paternal allele and tj m vars
for each maternal allele at locus j of individual i:
fi,jk , mij,k (1  k  t j )
3
{1 1}
{1 2}
4
{1 2}
{1 0}
fi,jk  1iff paternalalleleis mk
Individual 4:
f 41,1 f 41, 2
f 42,1 f 42, 2
m14,1 m14, 2
m42,1 m42, 2
f 41,1  f 41, 2  1
…
Formulation: Variables

Define 2 g vars for each paternal allele and maternal allele
at locus j for individual i
j
j
gi ,1, gi ,2


Var g1 = 0 (or 1) iff paternal allele is copied from father’s
paternal (or maternal) allele. Var g2 defined similarly.
Define r vars:
ri ,j1 , ri ,j2 (1  j  m  1)
ri ,j1  1 iff g ij,1  g ij,11
Formulation: Objective
Function

Objective function:
m 1
  (r
Non - Founders j 1
j
i ,1
 ri ,j2 )
Subject to Genotype constraints:
tj
tj
k 1
k 1
{0,0}  { f i ,jk  1 ,  mij,k  1}
tj
tj
k 1
k 1
{mrj ,0}  { f i ,jr  mij,r  1 ,  f i ,jk  1 ,  mij,k  1}
{mrj , mrj }  { f i ,jr  mij,r  1}
{mrj , msj }  { f i ,jr  f i ,js  mij,r  mij, s  f i ,jr  mij,r  f i ,js  mij, s  1}
Formulation: Constraints

Mendelian law of inheritance constraints (a child
i and its father f ):
f i ,jk  f f j,k  g ij,1  0
f i ,jk  m fj ,k  g ij,1  1

Constraints for r vars:
ri ,jl  g ij,l  g ij,l1  0
ri ,jl  g ij,l  g ij,l1  2
 ri ,jl  g ij,l  g ij,l1  0
 ri ,jl  g ij,l  g ij,l1  0
A Partial Order Relationship
Denote:
 1
y
y 
1  y   0

Inequalities with 2 variables:
yi  y j
1
4
3
1
8
5
6
2
7
10
3’
9
11
8
9
2
11
10
Forced Variables
• Rule 1:
y 0 , y1 S  Inconsistency
• Rule 2:
( yi  y j )  ( yi  y1j  )  yi  0
( yi  y j )  ( yi1  y j )  y j  1
• Rule 3:
yi  yi1  yi  0
Lower and Upper Bounds
• Lower bounds
– Linear relaxation.
– Summation of the number of recombinants in each
nuclear family.
– Effective for data with large number of
recombinants.
• Upper bound
– Obtained by block-extension algorithm.
– Effective for data with small number of
recombinants.
Statistical Assessment
• E-M algorithm to estimate haplotype
frequencies for data that consist of
multiple pedigrees.
fˆ (hi1 ) 
P(G, H | fˆ ) 
 (hi1 )
2N
ˆ (h1 ) fˆ (h2 )
f
 i i
founder i
 P(h | h
i
non -founder i
h
f ( i ) m(i )
)
PedPhase software
• Simulated data were generated to compare our
algorithms, as well as MRH in terms of efficiency,
accuracy.
• Three different pedigree structures.
• Multiallelic and biallelic data.
• Numbers of loci: 10, 25 and 50.
• Number of recombinants: 0-4.
• 100 runs per data set.
Pedigree Structures
Accuracy Results of BE
Algorithm
Efficiency Results
More Results from ILP
Real Data Analysis

Data set (Gabriel et al.’02)


93 members, 12 pedigrees (each with 7-8 members);
chromosome 3, 4 regions, each region 1-4 blocks.
Common
Haplotypes
&
Frequencies
Results From ILP on the Whole
Dataset
3.82
4.00
0.45
0.034
What if there are no relatives?
• Rely on linkage disequilibrium
• Assume that population consists of small
number of distinct haplotypes
• Haplotypes tend to be similar
Clark’s Haplotyping Algorithm
• Clark (1990) Mol Biol Evol 7:111-122
• One of the first haplotyping algorithms
– Computationally efficient
– Very fast
• Today, more accurate alternatives are
often available
Clark’s Haplotyping Algorithm
• Find homozygous individuals
– Initialize a list of known haplotypes
• Resolve ambiguous individuals
– If possible, use two haplotypes from list
– Otherwise, use one known haplotype and augment
list
• If unphased individuals remain
– Assign phase randomly to one individual
– Augment haplotype list and continue from previous
step
Haplotyping via Perfect
Phylogeny - Model,
Algorithms, Empirical studies
Dan Gusfield, Ren Hua Chung
U.C. Davis
Cocoon 2003
The Perfect Phylogeny Model of
Haplotype Evolution
sites 12345
Ancestral haplotype 00000
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
The Perfect Phylogeny Model
We assume that the evolution of extant
haplotypes can be displayed on a rooted,
directed tree, with the all-0 haplotype at the
root, where each site
changes from 0 to 1 on exactly one edge,
and each extant haplotype is created by
accumulating the changes on a path from the
root to a leaf, where that haplotype is
displayed.
In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.
Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining set
of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a
genotype if the merge of the
a
S
b
haplotypes creates the
c
genotype. Example: The
Genotype matrix merge of 0 1 and 1 0 explains
2 2.
1
2
0
1
2
2
2
0
The PPH Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
0
1
0
0
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
1
0
1
0
c
c
0 10 10
2
b
00
a
10
a
01
b
01
The Alternative Explanation
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
1
0
0
1
0
0
No tree
possible
for this
explanation
Efficient Solutions to the PPH
problem - n genotypes, m sites
• Reduction to a graph realization problem
(GPPH) - build on Bixby-Wagner or Fushishige
solution to graph realization O(nm alpha(nm))
time.
• Reduction to graph realization - build on Tutte’s
graph realization method O(nm^2) time.
• Direct, from scratch combinatorial approach O(nm^2) Bafna et al.
• Berkeley (EHK) approach - specialize the Tutte
solution to the PPH problem - O(nm^2) time.
The DPPH Method
• Bafna et al. O(nm^2) time
• Based on deeper combinatorial
observations about the PPH problem.
• A matrix-centric approach (rather than
tree-centric), although a graph is used in
the algorithm.
First, we need to understand why some sets of haplotypes
have a perfect phylogeny, and some do not.
When does a set of haplotypes
fit a perfect phylogeny?
Arrange the haplotypes in a matrix, two
haplotypes for each individual. Then
(with no duplicate columns), the
haplotypes fit a unique perfect
phylogeny if and only if no two columns
contain all three pairs:
0,1 and 1,0 and 1,1
This is the 3-Gamete Test
The Alternative Explanation
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
1
0
0
1
0
0
No tree
possible
for this
explanation
The Tree Explanation Again
a
b
c
1
2
0
1
2
2
2
0
a
a
b
b
c
c
1
1
0
0
0
1
1
2
0
1
0
1
0
0
00
1
2
b
00
cc
a
a
b
01 01
PPH: The Combinatorial
Problem
Input: A ternary matrix (0,1,2) M with 2N rows
partitioned into N pairs of rows, where the
two rows in each pair are identical.
Def: If a pair of rows (r,r’) in the partition have
entry values of 2 in a column j then positions
(r,j) and (r’,j) are called Mates.
Output: A binary matrix M’ created from M
by replacing each 2 in M with either 0 or 1,
such that
a) A position is assigned 0 if and only if its Mate
is assigned 1.
b) M’ passes the 3-Gamete Test, i.e., does
not contain a 3x2 submatrix (after row and
column permutations) with all three
combinations 0,1; 1,0; and 1,1
Initial Observations
If two columns of M contain the following rows
20
2 0 mates
then M’ will contain a row with 1 0 and a row with 0 1 in
those columns.
This is a forced expansion.
Initial Observations
Similarly, if two columns of M contain the mates
21
21
then M’ will contain a row with 1 1 and a row with
0 1 in those columns.
This is a forced expansion.
If a forced expansion of two columns
creates 0 1 in those columns, then any 2 2
10
22
in those columns must be set to be
01
10
We say that two columns are forced out-of-phase.
If a forced expansion of two columns
creates 1 1 in those columns, then any 2 2
22
in those columns must be set to be
11
00
We say that two columns are forced in-phase.
1
a
a
b
b
c
c
d
d
e
e
1
1
2
2
1
1
1
1
2
2
2
3
2
2
0
0
2
2
2
2
2
2
2
2
2
2
2
2
2
2
0
0
Example:
Columns 1 and 2, and 1 and
3 are forced in-phase.
Columns 2 and 3 are forced
out-of-phase.
1
2
1
3
2
a 1
a 1
b 0
b 1
0
1
0
0
a 1
a 1
e 1
e 0
0
1
0
0
b 0
b 0
e 1
e 0
3
0
1
0
0
Overview of Bafna et al.
algorithm
First, represent the forced phase relationships, and
the needed decisions, in a graph G.
7
1
Graph G
6
3
4
2
5
Each node represents
a column in M, and each
edge indicates that the
pair of columns has
a row with 2’s
in both columns.
The algorithm builds this
graph, and then checks
whether any pair of nodes
is forced in or out of phase.
7
1
Graph Gc
6
3
4
Each Red edge indicates
that the columns are
forced in-phase.
Each Blue edge indicates
that the columns are
forced out-of-phase.
2
5
Let Gf be the subgraph of Gc
defined by the red and blue
edges.
7
1
Graph Gf has three
connected components.
6
3
4
2
5
The Central Theorem
There is a solution to the PPH problem for M if
and only if there is a coloring of the dashed edges of Gc
with the following property:
For any triangle (i,j,k) in Gc, where there is one row
containing 2’s in all three columns i,j and k
(any triangle containing at least one
dashed edge will be of this type), the coloring makes
either 0 or 2 of the edges blue (out-of-phase).
Nice, but how do we find such a coloring?
7
1
Graph Gf
6
4
2
5
Triangle Rule
Theorem 1: If there are any
dashed edges whose ends are
in the same connected
component of Gf, at
3 least one edge is in a triangle
where the other edges are
not dashed, and in every PPH
solution, it must be colored
so that the triangle has an
even number of Blue (out of
Phase) edges.
This is an “inferred” coloring.
7
1
6
3
4
2
5
7
1
6
3
4
2
5
7
1
6
3
4
2
5
Corollary
Inside any connected component of Gf, ALL the phase
relationships on edges (columns of M) are uniquely
determined, either as forced relationships based on
pairwise column comparisons,
or by triangle-based inferred colorings.
Hence, the phase relationships of all the columns in
a connected component of Gf are INVARIANT over all
the solutions to the PPH problem.
Comparing the programs - R.H.
Chung
• All three are fast and practical (under one
second) on problem instances of size 50 x
30.
• DPPH is the fastest, followed by HPPH
and GPPH.
• HPPH encounters memory problems with
large input.
sites
individ GPPH
DPPH
HPPH
30
50
0.65
0.0206 0.0215
300
150
9.3
3.0
4.49
500
250
36
11.5
21.5
2000
1000
2331
640
1866
times shown are in seconds on an 800 Mhz machine.
A Phase-Transition
Problem, as the ratio of sites to genotypes changes,
how does the probability that the PPH solution is
unique change?
For greatest utility, we want genotype data where the
PPH solution is unique.
Intuitively, as the ratio of genotypes to sites increases,
the probability of uniqueness increases.
Extension
• With recombination
• The papers: See
wwwcsif.cs.ucdavis.edu/~gusfield
The E-M Haplotyping Algorithm
• Excoffier and Slatkin (1995) Mol Biol Evol
12:921-927
• Provide a clear outline of how the
algorithm can be applied to genetic data
• Combination of two strategies
– E-M statistical algorithm for missing data
– Counting algorithm for allele frequencies
E-M Algorithm For Haplotyping
1. “Guesstimate” haplotype frequencies
2. Use current frequency estimates to
replace ambiguous genotypes with
fractional counts of phased genotypes
3. Estimate frequency of each haplotype by
counting
4. Repeat steps 2 and 3 until frequencies
are stable
E-M Algorithm for Haplotyping
• Cost grows rapidly with number of markers
• Typically appropriate for < 25 SNPs
– Fewer microsatellites
• More accurate than Clark’s method
• Fully or partially phased individuals
contribute most of the information
Enhancements to E-M
• List only haplotypes present in sample
– Gradually expand subset of markers under
consideration, eliminating haplotypes with low
estimated frequency from consideration at
each stage
• SNPHAP [Clayton (2001)]
• HAPLOTYPER [Qin et al. (2002)]
Divide-And-Conquer Approximation
• No. of potential haplotypes increases
exponentially
– Actual no. of haplotypes doesn’t
• Approximation
– Successively divide marker set
– Run E-M assuming segments associate randomly
– Proceed, ignoring composites of segments with zero
frequency
• Order: ~ m log m
• Exact E-M is order ~ 2m
Other Recent Developments …
• Newer methods try to further improve
haplotype estimation by favoring sets of
similar haplotypes
• Stephens et al. (2001) Am J Hum Genet
68:978-89
• Genealogical approach, which implies
haplotypes are similar to each other…
Method based on Gibbs sampler
• MCMC method
– Stochastic, random procedure
– Improves solution gradually
• Given initial set of haplotypes
• Sample haplotypes for one individual at a
time, assuming other haplotypes are true
• Repeat a few million times…
Update Procedure I
• Pick individual U to update at random
• Calculate haplotype frequencies F in all
other individuals
– Since everyone is “phased”, this is done by
counting
• Sample new haplotypes for U from
conditional distribution of U’s haplotypes
given F
Update Procedure I
• This procedure would produce an estimate
of haplotype frequencies that equivalent to
the E-M algorithm…
• Stephens et al (2001) suggested an
alternative estimate of F…
Update Procedure II
• Estimate F from the other individuals
• Construct F* to include haplotypes in F
and also other similar (possibly differing at
a few sites, due to mutations)
• Update U’s haplotypes conditional on F*