Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Download ReportTranscript Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Scalable Algorithms for Analysis of Genomic Diversity Data
Bogdan Paşaniuc
Department of Computer Science & Engineering University of Connecticut
Single Nucleotide Polymorphisms
Main form of variation between individual genomes:
single nucleotide polymorphisms (SNPs)
… ataggtcc
C
tatttcgcgc
C
gtatacacggg
A
ctata … … ataggtcc
G
tatttcgcgc
C
gtatacacggg
T
ctata … … ataggtcc
C
tatttcgcgc
C
gtatacacggg
T
ctata … High density in the human genome: 3 10 9 base pairs Vast majority bi-allelic 0/1 encoding 1x10 7 out of
Haplotypes and Genotypes
Haplotype:
description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor
Diploids:
two homologous copies of each autosomal chromosome One inherited from mother and one from father
Genotype:
description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 0 2 1 2 00 2 10 0 1 1 0 00 1 10 0 0 1 1 00 0 10 + genotype two haplotypes per individual
Introduction
Haplotype data function exact DNA sequence Haplotypes increased power of association Directly determining haplotype data is expensive and time consuming Cost effective high-throughput technologies to determine genotype data Need for computational methods for inferring haplotypes from genotype data: genotype phasing problem
Outline
Background on genomic diversity
The genotype phasing problem
Hidden Markov Model of Haplotype Diversity Genotype Imputation DNA Barcoding Conclusions
Genotype Phasing
For a genotype with k 2’s there are 2 k-1 pairs of haplotypes explaining it possible h1:0010 1 1 1 h2:0010 0 1 0 g: 0010 2 1 2 ?
h3:0010 0 1 1 h4:0010 1 1 0 Computational approaches to genotype phasing Statistical methods: PHASE, Phamily, PL, GERBIL … Combinatorial methods: Parsimony, HAP, 2SNP,
ENT
…
Minimum Entropy Genotype Phasing
Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g Coverage of h in f – number of times h appears in the image of f Entropy of a phasing:
Entropy
(
f
)
h
: cov(
h
,
f
) 0 cov(
h
, 2 |
G
|
f
) log( cov(
h
, 2 |
G
|
f
) )
Minimum Entropy Genotype Phasing [Halperin&Karp 04]:
Given a set of genotypes, find a phasing with minimum entropy
ENT Algorithm
Initialization
Start with random phasing
Iterative improvement step
While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it Min Entropy Objective is uninformative for long genotypes each haplotype compatible with 1 genotype all haplotypes have coverage of 1 entropy of all phasings = -log(1/2G)
Overlapping Window approach
Entropy is computed over short windows of size l+f l “locked” SNPs previously phased f “free” SNPs are currently phased … 4 3 2 1 … g 1 … g n locked free Only phasings consistent with the l locked SNPs are considered
Effect of Window Size
Experimental setup (1)
International HapMap Project, Phase II datasets 3.7 million SNP loci 3 populations: CEU, YRI: 30 trios JPT+CHB: 90 unrelated individuals Reference haplotypes obtained using PHASE Accuracy
Relative Genotype Error (RGE):
method percentage of missing genotypes inferred differently as reference
Relative Switching Error (RSE):
into the reference haplotype pairs number of switches needed to convert inferred haplotype pairs
Experimental setup (2)
Compared algorithms
ENT 2SNP
[Brinza&Zelikovsky 05]
Pure Parsimony Trio Phasing (ILP)
al. 05]
PHASE
[Stephens et al 01]
HAP
[Halperin&Eskin 04]
FastPhase
[Scheet&Stephens 06] [Brinza et
Results on HapMap Phase II Panels
Averages over the 22 chromosomes Runtime: ENT few hours PHASE months of CPU time on cluster of 238 nodes
Results on [Orzack et al 03] dataset
[Orzack et al. 03] 80 unrelated genotypes over 9 SNPs Haplotypes determined experimentally Ranking of algorithms remains the same Slight underestimation of true error rate
Effect of pedigree information
Outline
Background on genomic diversity The genotype phasing problem
Hidden Markov Model of Haplotype Diversity
Genotype Imputation DNA Barcoding Conclusions
Founder Haplotypes
Haplotypes in the current population arose from small number of founder haplotypes by mutation and recombination events Obtained using HaploVisual www.cs.helsinki.fi/u/prastas/haplovisual/
HMM Model
Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06] Models the ancestral haplotype population Paths with high transition probability haplotypes “founder” Transitions from one founder to other founder recombination events Emissions mutation events
HMM Training
Previous works use EM training of HMM based on unrelated genotype data 1.
2-step procedure: Infer haplotypes using ENT Uses all available pedigree information 2.
Baum-Welch training on inferred haplotypes Maximizes the likelihood of the haplotypes
Maximum Probability Genotype Phasing
Phase G as pair (h 1 ,h 2 ) = argmax P(h Maximum phasing probability: 1 )P(h 2 )
P
(
G
) MAX
P
(
h
1 )
P
(
h
2 ) How hard is to compute maximum phasing probability in the HMM?
Conjectured to be NP-hard [Rastas et al 07]
Theorem
Cannot approximate P(G) within O(n 1/2 ), unless ZPP=NP, where n is the number of SNP loci
Complexity of Computing Maximum Phasing Probability
Reduction from Max Clique Transitions 1,1/2 Initial transition 2 deg(v)+1(2) / α All haps prob 1/ α
Complexity of Computing Maximum Phasing Probability
H representing clique of size k will be emitted along k paths P(H) = k/α By construction H’ (complement of H) can be emitted along second block G = 22…22 P(G)=max(P(H)) 2 G has a clique of size k or more iff P(G) ≥ (k/α) 2 Maximum probability genotype phasing is NP hard
Heuristic Decoding Algorithms
Viterbi Decoding Maximum probability of emitting a haplotype pair that explain G along two HMM paths Efficiently computed using Viterbi’s algorithm Posterior Decoding For each SNP choose the states that are most likely at that locus given the genotype G Find most likely emissions at each SNP to explain G Efficiently computed using forward and backward algorithm Sampling from the HMM posterior distribution generate pairs of haplotypes that explain G conditional on the haplotype distribution represented by the HMM Combine the sample into a single phasing
Greedy Likelihood Decoding
Uses forward values computed by forward algorithm f h (i,q) = the total probability of emitting the first i alleles of the haplotype h and ending up at state q at level i. P(H|M)= ∑ f h (n,q) Constructs (h, h’) with (x,y) at SNP i, s.t. the first i, is maximized probability of the phasing up to locus i, given the already determined phasing for the P(
h
[ 1 ...
i
1 ]
x
|
h
[ 1 ...
i
1 ] ) P(
h
' [ 1 ...
i
1 ]
y
|
h
' [ 1 ...
i
1 ] )
q
Q i f h
(
i
,
q
)
q
Q i f h
' (
i
,
q
) 2 variants: left-to-right or right-to-left
Combined Greedy Likelihood decoding
h h’ Left to right phasing Right to left phasing h h’ SNP i Combined phasing P(Comb. phasing at SNP i) = ∑ f h (i,q) b h (i,q) x ∑ f h’ (i,q) b h’ (i,q) SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 4 haplotypes
Tweaking a Phasing by Local Switching
SNP i New phasing obtained by switching at SNP i P (new phasing) = ∑ f h (i,q) b h (i,q) x ∑ f h’ (i,q) b h’ (i,q) SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 2 haplotypes Iterative 1-OPT procedure While there exists a SNP that improves the likelihood of the phasing obtained by switching at that SNP, find the SNP that yields the highest increase and perform switching
Experimental Setup
ADHD dataset Chromosome X genotype data from the Genetic Association Information Network (GAIN) study on Attention Deficit Hyperactivity Disorder (ADHD) 958 parent-child trios from the International Multi-site ADHD Genetics (IMAGE) project Phased the children as unrelated on a 50 SNP window
Decoding alg.
Viterbi 11.814
Posterior HMM Sampling 26.736
15.323
Greedy left to right 12.154
Greedy right to left 13.283
Greedy combined 11.838
Random phasing 50.559
Tweaking
11.571
11.940
11.826
11.693
12.057
11.510
14.764
Method
ENT fastPHASE 13.513
12.035
PHASE v 2.1 10.393
2SNP 14.497
BEAGLE r=1 11.862
BEAGLE r=4 10.442
Tweaking
11.705
11.231
11.219
11.729
11.705
11.304
Outline
Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity
Genotype Imputation
DNA Barcoding Conclusions
Genome-wide case-control association studies
Preferred method for finding the genetic basis of human diseases 1.
2.
Large number of markers (SNPs) typed in cases and controls Statistical test of association locus disease-correlated Disease causal SNPs unlikely to be typed directly Limited coverage of current genotyping platforms Vast number of SNPs present across the human genome
Genotype Imputation
Imputation of genotypes at un-typed SNP loci Powerful technique for increasing the power of association studies Typed markers in conjunction with catalogs of SNP variation (e.g. HapMap) predictors for SNP not present on the array Challenge: Optimally combining the multi-locus information from current + multi-locus variation from HapMap
HMM Based Genotype Imputation
1.
• Integrate the HapMap variation information into the HMM Train HMM using the haplotypes from the panel related to the studied population (e.g. CEU panel: Utah residents with ancestry from northern and western Europe) 2.
• Compute probabilities of missing genotypes given the typed genotype data
P
(
g i
x
|
G
,
M
)
P
(
G
,
g i
x
|
P
(
G
|
M
)
M
) g i is imputed as x, where
x
argmax
x
{ 0 , 1 , 2 }
P
(
G
,
g i
x
|
M
)
Related Problems
Missing Data Recovery Fill in the genotypes uncalled by the genotyping algorithm Genotype Error Detection and Correction If g i is present, then the increase in likelihood obtained by replacing g i with x is:
P
(
G g i
x
|
M
)
P
(
G
|
M
)
Likelihood Computation
P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths.
Computed in O(nK 3 ) by a 2-path extension of the forward algorithm followed by a factor K speed-up [Rastas07]
f
(
j
;
q
1 ,
q
2 )
E
(
j
;
q
1 ,
q
2 ) (
q
' 1 ,
q
' 2 )
Q f j
2 1 (
j
1 ;
q
1 ' ,
q
2 ' ) (
q
1 ' ,
q
1 ) (
q
2 ' ,
q
2 )
Experimental Setup
WTCCC Dataset Genotype data of the 1958 birth cohort from the The Welcome Trust Consortium genome-association study 1,444 individuals from this cohort were typed using both the Affymetrix 500k platform and a custom Illumina 15k platform Affymetrix data + CEU HapMap haplotypes used to impute genotypes at the SNP loci present of the Illumina chip and not on the Affymetrix chip The actual Illumina genotypes were then used to estimate the imputation accuracy
Results
Estimates of the allele 0 frequencies based on Imputation vs. Illumina 15k
Results
Accuracy and missing data rate for imputed genotypes for different thresholds. Dashed line = missing data rate Solid line = discordance rate
Effect of Errors and Missing Data
Added additional 1% genotyping errors and 1% missing genotypes
EDC+MDR+IMP MDR+IMP IMP Error Detection TP Rate(%) FP Rate(%)
69.46
0.20
-
Error Correction Missing Data Recovery Imputation Accuracy(%) Error Rate(%) Error Rate(%)
97.16
7.62
7.72
6.49
6.63
6.64
TP Rate = correctly flagged errors out of total errors inserted FP Rate = incorrectly flagged genotype out of total correct genotypes Error Correction Accuracy = correctly recovered out of flagged ones
Outline
Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity Genotype Imputation
DNA Barcoding
Conclusions
DNA barcoding
Recently(2003) proposed by taxonomists as a tool for rapid species identification Use short DNA region as “fingerprint” for species Region of choice: cytochrome c oxidase subunit 1 mitochondrial gene ("COI", 648 base pairs long).
Key assumption:
Existence of “barcoding gap” Inter-species variability >> than intra-species variability
BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07]
http://www.barcodinglife.org
Currently: 38,539 species, 388,582 barcodes
DNA barcoding challenges
Efficient algorithms for species identification Millions of species Meaningful confidence measures BOLD identification system showed to have unclear confidence measures [Ekrem et al.07]: New species discovery Sample size optimization #barcodes per species required Barcode length Barcode quality Number of regions required
Species identification problem
Given repository containing barcodes from known species and a new barcode find its species Several methods proposed for assigning specimens to species TaxI (Steinke et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07)…
No direct comparisons on standardized benchmarks
This work: Direct comparison of methods from three main classes Distance-based, tree-based, and statistical model-based Explore the effect of repository size #barcodes/species, #species
Methods
Distance-based Hamming distance, Aminoacid Similarity, Convex Score similarity, Tri-nucleotide frequency distance, Combined method Tree-based Exemplar NJ [Meyer&Paulay05] Profile NJ [Muller et al 04]
Phylogenetic transversal
Statistical model-based Likelihood ratio test [Matz&Nielsen06] PWMs
Inhomogeneous Markov Chains
Inhomogeneous Markov Chain (IMC)
A
t
1 A
t
2 A
t
3 A start C C C C … T T T T G G G G locus 1 locus 2 locus 3 locus 4 Takes into account dependencies between
consecutive
loci
n
1
P
(
x
|
M
) (
x
1 )
i
1
t
i
(
x
i
,
x
i
1 )
Comparison of representative methods
MIN-HD IMC Phylo ACG Birds Bats Guyana Fish Australia Cowries
98.81% 97.59% 100.00% 99.30% 88.80% 95.27% 97.23% 93.29% 92.33% 100.00% 98.55% 99.58% 99.30% 89.83% 81.00% Leave one out experiment Hesperidia of the ACG 1 [Hajibabaei M. et al, 05]: 4267 barcodes, 561 species Birds of North America [Kerr K.C.R. et al, 07]: 2589 barcodes, 656 species Bats of Guyana [Clare E.L. et al, 06]: 840 barcodes, 96 species Fishes of Australia Container Part [Ward et. al, 05]: 754 barcodes, 211 species Cowries [Meyer and Paulay, 05]: 2036 barcodes, 263 species
1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 0
Accuracy vs Species size
150 0.4
0.3
0.2
0.1
0 1 0.9
0.8
0.7
0.6
0.5
0 50 MIN-HD 100 1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 0 50 IMC 100 50 Phylo 100 150 150
Accuracy vs. #Species
0.5
0.4
0.3
0.2
0.1
0.0
1.0
0.9
0.8
0.7
0.6
300 600 900 1200 1500 MIN-HD IMC Phylo*
Conclusions
Highly scalable method for genotype phasing Several orders of magnitude faster than current methods Phasing accuracy close to the best methods Exploits all pedigree information available HMM model of haplotype diversity Hardness result for genotype phasing Improved decoding algorithms for phasing Imputation of genotypes at un-typed SNPs DNA-barcoding Introduced new methods for species identification Comprehensive comparison to existing methods
Acknowledgments
Prof. Ion Mandoiu Profs. Sanguthevar Rajasekaran, Alex Russell Sasha Gusev (Entropy phasing, DNA barcoding) Justin Kennedy (HMM Imputation and Error detection) James Lindsay, Sotiris Kentros (DNA barcoding)
References
Genotype phasing: B. Pasaniuc and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization.
In Proc. 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
, pages 3482-3486, 2006.
A. Gusev, I.I. Mandoiu, and B. Pasaniuc. Highly scalable genotype phasing by entropy minimization.
IEEE/ACM Trans. on Computational Biology and Bioinformatics 5, pp. 252 261, 2008
.
HMM model, genotype imputation and error detection: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden Markov models of haplotype diversity.
In Proc. 7th Workshop on Algorithms in Bioinformatics(WABI07) LNBI, pp 73-84,
2007.
J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. GEDI: Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity.
(in preparation)
.
DNA-barcoding: B. Pasaniuc, S. Kentros and I.I. Mandoiu. DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers
Biodiversity Data Workshop
, 2006.
, The DNA Barcode Data Analysis Initiative (DBDAI): Developing Tools for a New Generation of
B. Pasaniuc, S. Kentros and I.I. Mandoiu. Model-based species identification using DNA barcodes,
39th Symposium on the Interface: Computing Science and Statistics
, 2007. B. Pasaniuc, A. Gusev, S. Kentros, J. Lindsay and I.I. Mandoiu. A Comparison of Algorithms for Species Identification Based on DNA Barcodes.
2nd International Barcode of Life Conference
, Academia Sinica, Taipei, Taiwan, Sept. 17-21, 2007