Transcript Document

Combinatorial methods in Bioinformatics: the
haplotyping problem
Paola Bonizzoni
DISCo
Università di Milano-Bicocca
July 17, 2015
1
Content






Motivation: biological terms
Combinatorial methods in haplotyping
Haplotyping via perfect phylogeny : the PPH problem
Inference of incomplete perfect phylogeny:
algorithms
Incomplete pph and missing data
Other models: open problems
July 17, 2015
2
Diploid organism
Biological terms
genotype
haplotype i
Biallelic site i
A
i+1 A
i+2 A
maternal
July 17, 2015
G
C
A
heterozygous
|Value(i)  { A,C,G,T}|  2
homozygous
paternal
3
Motivations

Human genetic variations are related to diseases (cancers, diabetes, osteoporoses)
most common variation is the Single Nucleotide Polymorphism (SNP) on
haplotypes in chromosomes
The human genome project produces genotype sequences of humans

Computational methods to derive haplotypes from genotype data are
demanded

Ongoing international HapMap project: find haplotype differences on large scale
population data
graphs

Set-cover problems
Combinatorial methods:
July 17, 2015
Optimization problems
4
Haplotyping: the formal model

Haplotype: m-vector

Genotype: m-sequence
h=<0, 1,…, 0> over {0,1}m
g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*}
g = <*,*…,00,…, 1 1>
Def.
Haplotypes <h, k> solve genotype g iff :
g(i)=*
implies
h(i)= k(i)= g(i)
July 17, 2015
h(i)  k(i)
otherwise
5
Examples
g =<0,*,1,*,0,1,1>
k=<0,0,1,1,0,1,1>
g solved by <k,h> g
h
k
h=<0,1,1,0,0,1,1>
Clark inference rule
h1
g1
h2
h1=<0,0,1,1,0,1,1>
h1=<0,0,1,1,0,1,1>
g1 =<0,*,1,*,0,1,1>
h2=<0,1,1,0,0,1,1>
g2 =<0,1,*,0,0,1,1>
g2 =<0,1,*,0,0,1,1>
g3 =<0,0,*,*,1,1,1>
July 17, 2015
g3 =<0,1,0,*,0,1,1>
h1=<0,0,1,1,0,1,1>
h2=<0,1,1,0,0,1,1>
h3=<0,1,0,0,0,1,1>
g3 =<0,1,0,*,0,1,1>
6
Haplotype inference:
the general problem
Problem HI:
Instance: a set G={g1, …,g m} of genotypes and a set
H={h1, …,h n } of haplotypes,
Solution: a set H’ of haplotypes that solves each
genotype g in G s.t. H  H’.

H’ derives from an inference RULE
July 17, 2015
7
Type of inference rules


Clark’s rule: haplotypes solve g by an iterative rule
Gusfield coalescent model: haplotypes are related to
genotypes by a tree model

Pedigree data: haplotypes are related to genotypes by a
directed graph
July 17, 2015
8
HI by the perfect phylogeny model

00000
IDEA:
g1= 0, 1,*,*,1
G
H
g2= *, 0,0,0,1
0, 1,0,1,1
0, 1,1,0,1
0, 0,0,0,1
1, 0,0,0,1
Genotypes are the mating of haplotypes in a tree
Given G find H and T that explain G!
July 17, 2015
9
Perfect Phylogeny models


Input data: 0-1 matrix A characters, species
Output data: phylogeny for A
c1
c2
c3 c4 c5
s1 1
1
0
0
0
s2 0
0
1
0
0
s3 1
1
0
0
1
s4 0
0
1
1
0
R
c3
C1 ,
c4
s4
c2
c5
s2
s1
s3
Path c3c4
July 17, 2015
10
Perfect phylogeny
Def.




A pp T for a 0-1 matrix A:
each row si labels exactly one leaf of T
each column cj labels exactly one edge of
T
each internal edge labelled by at least one
column cj
row si gives the 0,1 path from the root to si
July 17, 2015
0
1
C1 ,
c4
c2
c5
s2
s4
Path c3c4
0
c3
s1
s3
1
11
pp model: another view
x
L(x) cluster of x:
set of leaves of T x
s2
s4
s1
s3
A pp is associated to a tree-family (S,C) with S={s1 ,…, sn}
C={S’  S: S’ is a cluster} s.t. X, Y in C , if XY then XY
or Y  X.
July 17, 2015
12
pp : another view
A tree-family (S,C) is represented by a 0-1 matrix:
ci
• ci
S’ : s j  S’ iff b ji=1
s
j
0
1
0
0
0
0
0
1
0
0
1
1
0
0
1
0
0
1
1
0
• for each set in C at least a column
Lemma
A 0-1 matrix is a pp iff it represents a tree-family
July 17, 2015
13
Haplotyping by the pp
A 0-1 matrix B represents the phylogenetic tree for
a set H of haplotypes:

si
haplotype

ci
SNPs
SNP site
ci
si
00000
01000
01001
11000
July 17, 2015
0-1
switch in position i
only once in the tree !!
00000
00000
01000
01001
11000
14
Haplotyping and the pp: observations


The root of T may not be the haplotype 000000
0-1 switch or 1-0 switch (directed case)
00011
00000
00011
0-1 switch
1-0
01000
11000
01100
July 17, 2015
01001
01010
01010
01001
11001
11010
15
HI problem in the pp model


Input data: a 0-1-*matrix B n  m of genotypes G
Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t.
(1) each g  G is solved by a pair of rows <h,k> in B’
(2) B’ has a pp (tree family)
???
01*1*001*
001*11*11
0000*1*1*
DECISION
Problem
0, 1,0,1,1
July 17, 2015
16
An example
a 1 0
a * *
b 0 *
c 1 0
a’ 0 1
b 0 1
b’ 0 0
c 1 0
c’ 1 0
July 17, 2015
b’
a
c c’
a’
b
17
The pph problem: solutions



An undirected algorithm Gusfield Recomb 2002
An O(nm2)- algorithm Karp et al. Recomb 2003
A linear time O(nm) algorithm ??
Optimal algorithm
A related problem: the incomplete directed pp (IDP)
Inferring a pp from a 0-1-* matrix
O(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan
SIAM 2004
July 17, 2015
18
IDP problem
Instance: A 0-1-? Matrix A
Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say
“no pp exists”
C1
1 2 3 4 5
C5
1 0
? 0 0 1
? ?1 00 11 00
1
? 0 1 ?0 ?1
1
C2 C4
S2
C3
S1
S3
OPEN PROBLEM: find an optimal algorithm ??
July 17, 2015
19
Decision algorithms for incomplete pp
Based on:
Characterization of 0-1 matrix A that has a pp
Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1}
-Tree family Forbidden subgraph
00
Y 01
11
July 17, 2015
X
10
- forbidden
submatrix
–
C’
c
give a no certificate
s1 10
s2
10
11
110 1 s3 01
20
Test: a 0-1 matrix A has a pp?
O(nm) algorithm (Gusfield 1991)
Steps:
1.
Given A order {c1, …,cm} as (decreasing) binary
numbers
A’
2.
Let L(i,j)=k , k = max{l <j: A’[i,l]=1}
3.
Let index(j) = max{L(i,j): i}
4.
Then apply th.

TH.
A’ has a pp iff L(i,j) = index(j) for each (i,j)
s.t. A’[i,j]=1
July 17, 2015
21
Idea:
July 17, 2015
22
The IDP algorithm
c
s1
July 17, 2015
C’
s2
s3
23
Other HI problems via the pp model
Incomplete 0-1-*-? matrix because of missing data:
haplotypes pp (Ihpp)
haplotype rows
genotype pp (Igpp)
genotype rows

Algorithms:

Ihpp = IDP given a row as a root (polynomial time)
NP-complete otherwise

Igpp has polynomial solution under rich data hypothesis
(Karp et al. Recomb 2004 – Icalp 2004 )
NP-complete otherwise
July 17, 2015
24
HI problem and other models

Haplotype inference in pedigree data under the
recombination model
0
1
0
0
0
1
0
0
0
1
0
0
1
0
1
maternal
July 17, 2015
0
0
0
0
0
0
0
0
1
0
0
1
0
paternal
recombination
child
25
Pedigree graph
father
mather
Single Mating Pedigree
Tree
child
Pedigree Graph
Mating loop
Nuclear family
July 17, 2015
26
Haplotype inference in pedigree
00
10
01
11
10
00
01
0|0 0
0|0 0|0 0|0
1|0 0|1
1
0|1 1
1|0 0|1 1|0
1|1 1|1
1
1|0 1
1|0 0|1 0|1
0|0 0|0
0
Paternal
maternal
11
0|1
01
01
1|1
11
1|0
10
July 17, 2015
27
Problems:

MPT-MRHI (Pedigree tree multi-mating minimum recombination HI)

SPT-MRHI (Pedigree tree single-mating minimum recombination HI)
Np-complete even if the graph is acyclic, but unbounded number of children…
OPEN
July 17, 2015
28
Conclusions
July 17, 2015
29
References
July 17, 2015
30