Genotype and Haplotype Reconstruction from LowCoverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S.
Download ReportTranscript Genotype and Haplotype Reconstruction from LowCoverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S.
Genotype and Haplotype Reconstruction from LowCoverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S. Dinakar, J. Duitama, Y. Hernández, J. Kennedy, and Y. Wu Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion Ultra-high throughput DNA sequencing Recent massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to classic Sanger sequencing Roche/454 FLX Titanium 400bp reads 400-600Mb/10h run ABI SOLiD 3 35-50bp reads 5-7.5Gb/3.5-7 day run Illumina Genome Analyzer II 35-75bp reads 2-3Gb/2 day run Helicos HeliScope 25-55bp reads ~2.5Gb/day UHTS enables personal genomics $100,000,000 C.Venter [email protected] $10,000,000 Cost $1,000,000 $100,000 J. Watson [email protected] NA18507 Illumina@36x SOLiD@12x $10,000 $1,000 $100 days weeks months Sequencing Time years Challenges for medical applications of sequencing Sequencing can potentially provide all genetic variations (SNPs, CNVs, genome rearrangements) at single-base resolution… However, medical use requires determination of both alleles (genotype) at variable loci Accurate genotype calling is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] Allele coverage for heterozygous SNPs (Watson 454 @ 5.85x avg. coverage) 6 5 Variant allele coverage 4 3 2 1 0 -1 -1 0 1 2 3 Reference allele coverage 4 5 6 Allele coverage for heterozygous SNPs (Watson 454 @ 2.93x avg. coverage) 6 5 Variant allele coverage 4 3 2 1 0 -1 -1 0 1 2 3 Reference allele coverage 4 5 6 Allele coverage for heterozygous SNPs (Watson 454 @ 1.46x avg. coverage) 6 5 Variant allele coverage 4 3 2 1 0 -1 -1 0 1 2 3 Reference allele coverage 4 5 6 Prior work Most prior genotype calling methods are based on allele coverage [Levy et al 07] and [Wheeler et al 08] require that each allele be covered by at least 2 reads in order to be called Combined with hypothesis testing based on the binomial distribution when calling hets Binomial probability for the observed number of alleles must be at least 0.01 [Wendl&Wilson 08] generalize coverage methods to allow an arbitrary minimum allele coverage k Prior work (contd.) MAQ [Li,Ruan&Durbin 08] Widely used read mapping program Single SNP genotype calling incorporating read mapping confidence and quality scores Mostly tuned for de novo SNP discovery… What coverage is required? [Wendl&Wilson 08] estimate that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs” Do heuristic inputs help? We propose methods incorporating additional sources of information extracted from a reference panel such as Hapmap: Allele/genotype frequencies Linkage disequilibrium Experimental results show significantly improved genotyping accuracy Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion Basic assumptions Known SNP positions Biallelic SNPs 0 = major allele, 1 = minor allele SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Incorporating base call and read mapping uncertainty ri = set of mapped reads covering SNP locus i For each read r in ri r(i) = the allele observed at locus i qr ( i ) /10 r (i ) 10 = probability that r(i) is incorrect, where qr(i) is the phred quality score of r(i) mr = mapping confidence of r 012100120 Mapped reads with allele 0 Inferred genotypes Mapped reads with allele 1 Sequencing errors Incorporating base call and read mapping uncertainty ri = set of mapped reads covering SNP locus i For each read r in ri r(i) = the allele observed at locus i qr ( i ) /10 r (i ) 10 = probability that r(i) is incorrect, where qr(i) is the phred quality score of r(i) mr = mapping confidence of r P(ri | Gi 0) (1 r (i ) ) rri r ( i ) 0 mr rri r ( i ) 1 mr r (i ) P(ri | Gi 2) 1 r ri P (ri | Gi 1) 2 (1 r (i ) ) rri r ( i ) 1 mr mr m r rri r ( i ) 0 r (i ) Single SNP genotype calling Applying Bayes’ formula: P( gi ) P(ri | Gi gi ) P(Gi gi | ri ) g{0,1,2} P(Gi gi )P(ri | Gi g) Where P(Gi gi ) are genotype frequencies inferred from a representative panel Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion Haplotype structure in human populations HMM model of haplotype frequencies Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06] Graphical Model Representation F2 H1 H2 … Fn Hn Random variables F1 Fi = founder haplotype at locus i Hi = observed allele at locus i For fully specified model and given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm, where n=#SNPs, K=#founders HF-HMM for multilocus genotype inference F1 F2 H1 H2 F'1 F'2 H'1 H'2 … Fn Hn … F'n H'n P(f1), P(f’1), P(fi+1|fi), G1 P(f’i+1|f’i), P(hi|fi), G2 P(h’i|f’i) trained using Gn Baum-Welch algorithm on haplotypes inferred from the populations of origin for mother/father R1,1 … R1,c1 R2,1 … R2,c2 Rn,1 … Rn,cn HF-HMM for multilocus genotype inference F1 F2 H1 H2 F'1 F'2 H'1 H'2 G1 R1,1 … … Fn Hn … F'n H'n G2 Gn P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 R2,1 … R2,c2 Rn,1 … Rn,cn otherwise R1,c1 HF-HMM for multilocus genotype inference F1 F2 H1 H2 F'1 F'2 … Fn Hn … F'n P(Ri, j r | Gi gi ) definedas in single SNP case H'1 H'2 G1 R1,1 … H'n G2 R1,c1 R2,1 … Gn R2,c2 Rn,1 … Rn,cn Multilocus genotyping problem GIVEN: • Shotgun read sets r=(r1, r2, … , rn) • Trained HMM models representing LD in populations of origin for mother/father • Quality scores & read mapping confidence values FIND: • Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r) Computational complexity Theorem: maxgP(g | r) cannot be approximated within O(n1 ) unless ZPP=NP Idea: reduction from the clique problem Posterior decoding algorithm 1. For each i = 1..n, compute gi * argmaxgi P( gi | r) argmaxgi P( gi , r) 2. Return g* ( g1*,...,g n *) Forward-backward computation P( gi , r ) P(ri | gi ) f K … i 1 i i i f 1 f , f f , f f , f ( gi ) K ' ' i i fi i ' i ' i i i … hi … f’i … h’i gi r1,1 … r1,c 1 ri,1 … ri,c i Rn,1 … Rn,cn Forward-backward computation P( gi , r ) P(ri | gi ) f K … i 1 i i i f 1 f , f f , f f , f ( gi ) K ' ' i i fi i ' i ' i i i … hi … f’i … h’i gi r1,1 … r1,c 1 ri,1 … ri,c i Rn,1 … Rn,cn Forward-backward computation P( gi , r ) P(ri | gi ) f K … i 1 i i i f 1 f , f f , f f , f ( gi ) K ' ' i i fi i ' i ' i i i … hi … f’i … h’i gi r1,1 … r1,c 1 ri,1 … ri,c i Rn,1 … Rn,cn Forward-backward computation P( gi , r ) P(ri | gi ) f K … i 1 i i i f 1 f , f f , f f , f ( gi ) K ' ' i i fi i ' i ' i i i … hi … f’i … h’i gi r1,1 … r1,c 1 ri,1 … ri,c i Rn,1 … Rn,cn Forward-backward computation P( gi , r ) P(ri | gi ) f K … i 1 i i i f 1 f , f f , f f , f ( gi ) K ' ' i i fi i ' i ' i i i … hi … f’i … h’i gi r1,1 … r1,c 1 ri,1 … ri,c i Rn,1 … Rn,cn Runtime Direct recurrences for computing forward probabilities: 1f , f P ( f1 ) P ( f1' ) ' i i i f i , f i' K K f i1 1 f i'1 1 i 1 f i1 , f i'1 P( f i | f i 1 )P( f i | f ) ' ' i 1 i 1 f i1 , f i'1 ( gi 1 ) Runtime reduced to O(nK3) by reusing common terms: if , f ' i i where i f i1 , f i' K f i'1 1 ' ' i 1 ' ' i 1 P ( f | f ) ( gi 1 ) f ,f i i 1 f ,f f i'1 1 i i 1 , f i K if , f ( gi ) i P( f i | f i 1 ) if i 1 ' i 1 i 1 ' i 1 ' ' P ( h | f ) P ( h | f i i i i )P(ri | Gi hi hi ) hi ,hi' {0,1} Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion Pipeline for LD-Based Genotype Calling Mapped reads & confidence values Read sequences Reference genome sequence >gnl|ti|1779718824 name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT >gnl|ti|1779718824 name:EI1W3PE02ILQXT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT >gnl|ti|1779718824 name:EI1W3PE02ILQXT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT >gnl|ti|1779718825 name:EI1W3PE02GTXK0 TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT >gnl|ti|1779718825 name:EI1W3PE02GTXK0 CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC >gnl|ti|1779718825 name:EI1W3PE02GTXK0 GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT assembly AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA assembly GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA Quality scores AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti|1779718824 name:EI1W3PE02ILQXT 28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 7 27 42 35 21 6 28name:EI1W3PE02ILQXT 43 36 22 10 27 42 35 20 6 28 43 36 22 9 >gnl|ti|1779718824 2840 4434 3614 2444 1436423 2813 28 28 2742 2835 2621 26735 26 28 2828 2843 2836 2622 28928 2 27 >gnl|ti|1779718824 3 name:EI1W3PE02ILQXT 28 27 4240 3534 21186 28 4328 3628 2227 1033 2724 4226 3528 2028628 28 40 43 33 36 14 22 28 9 36 27 28 2828 2843 28 26 28 22827 35 21 26 26 37928 29 28 28 28 28 27 2813 2828 3742 28 27 27735 2826 36 28 37 36 22 2840 4434 3614 2444 1436 423 28 28 27 28 26 26 27 4240 3534 21 6 28 43 36 22 10 2724 4228 3528 2027 2828 4337 3629 2236 9 27 28 28 28 27 28 28 18 3 28 28 28 27 28 33 24 26 28 28628 28 40 33 14 28 36 27 27 28 27 28 4326 3626 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 26 28 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 2827 3736 27 40 3428 1828328 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 27 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 28 26 2628 3733 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 37 28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 3640 2734 15 28 2828 2828 2728 2824 2828 2837 2429 2828 2819 2728 2826 2837 3729 2926 3639 2733 2713 2837 2728 28 28 3328 2321 2824 3328 2327 2841 3634 2715 3328 2336 2827 3526 2528 2824 2835 3627 2728 3640 2734 15 28 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15 SNP genotype calls Hapmap haplotypes 90 209342 16 F 0 0 2110001?0100210010011002122201210211?1221220212000 90 209342 16 F 0180 F 15 16 21100012010021001001100?100201?10111110111?0212000 2110001?0100210010011002122201210211?1221220212000 15 16 M00 18 F 15 90 20934221120010012001201001120010110101011111011110212000 16 F 021100012010021001001100?100201?10111110111?0212000 0 7M 00 15 M 00 2110001?0100210010011002122201210211?1221220212000 2110001001000200122110001111011100111?121210222000 21120010012001201001120010110101011111011110212000 18 F 15 168 F 0 0 7M 00 21100012010021001001100?100201?10111110111?0212000 011202100120022012211200101101210211122111?0120000 15 M2110001001000200122110001111011100111?121210222000 00 8 F 0 12 0 F 9 10 21120010012001201001120010110101011111011110212000 21100010010002001221100010110111001112121210220000 7 M 0011202100120022012211200101101210211122111?0120000 0 M 00 12 F 99 10 2110001001000200122110001111011100111?121210222000 011?001?012002201221120010?10121021112211110120000 8 F 0 21100010010002001221100010110111001112121210220000 0 11 M 7 8 9 M 0 0 011202100120022012211200101101210211122111?0120000 21100210010002001221100012110111001112121210222000 011?001?012002201221120010?10121021112211110120000 12 F 9 10 11 M 7 8 21100010010002001221100010110111001112121210220000 9 M 021100210010002001221100012110111001112121210222000 0 011?001?012002201221120010?10121021112211110120000 11 M 7 8 21100210010002001221100012110111001112121210222000 … … … … … … … rs12095710 T T 9.988139e-01 rs12127179 C T 9.986735e-01 rs11800791 G G 9.977713e-01 rs11578310 G G 9.980062e-01 rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01 rs17471528 A G 5.236099e-01 rs11804835 C C 9.977759e-01 rs11804836 C C 9.977925e-01 rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01 rs12122008 G G 5.121655e-01 rs17431341 A C 5.290652e-01 rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01 rs12121542 A A 5.555713e-01 rs11121605 T T 8.387705e-01 rs12563779 G G 9.982776e-01 rs11121607 C G 5.639239e-01 rs11121608 G T 5.452936e-01 rs12029742 G G 9.973527e-01 rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01 rs11121648 G G 9.077355e-01 rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01 rs1287635 C C 6.113270e-01 Datasets Watson Sequencing data: 74.4 million 454 reads (of 106.5 million reads used in [Wheeler et al 08]) Reference panel: CEU genotypes from Hapmap r23a phased using the ENT algorithm [Gusev et al. 08] Ground truth: duplicate Affymetrix 500k SNP genotypes Read length distribution 2000000 1500000 1000000 500000 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Datasets (contd.) NA18507 (Illumina & SOLiD) Sequencing data: 525 million Illumina reads (36bp, paired) and 764 million SOLiD reads (24 - 44bp, unpaired) Reference panel: YRI haplotypes from Hapmap r22 Ground truth: Hapmap r22 genotypes excluding NA18507 haplotypes Mapping Procedure 454 reads mapped on human genome build 36.3 using the NUCMER tool of the MUMmer package [Kurtz et al 04] with default parameters Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Illumina and SOLiD reads mapped using MAQ [Li,Ruan&Durbin 08] with default parameters For reads mapped at multiple positions MAQ returns best position (breaking ties arbitrarily) together with mapping confidence We filtered bad alignments and discarded paired end reads that are not mapped in pairs using the “submap -p” command Mapping statistics Dataset Raw reads Raw sequence Mapped reads Test SNPs Avg. mapped SNP cov. Watson 74.2M 19.7Gb 49.8M (67%) 443K 5.85x 2.85M 6.10x 2.85M 3.21x NA18507 Illumina 525M 18.9Gb 397M (78%) NA18507 SOLiD 764M 21.15Gb 324M (42%) Concordance vs. avg. coverage (Watson 454 reads) 100 90 % Concordance 80 70 60 50 Binomial (Homo) HMM-Posterior (Homo) Binomial (Het) 40 30 HMM-Posterior (Het) 20 10 0 0 1 2 3 Avg. Coverage 4 5 6 Tradeoff with call rate (5.85x Watson 454 reads, homo SNPs) 100 % concordance 99.5 99 98.5 98 97.5 97 0 10 20 30 40 % uncalled 1SNP-Posterior Binomial0.01 HMM-Posterior 50 Tradeoff with call rate (5.85x Watson 454 reads, het SNPs) 100 98 % concordance 96 94 92 90 88 86 84 82 80 0 5 10 15 20 25 30 35 40 % uncalled 1SNP-Posterior Binomial0.01 HMM-Posterior 45 50 Concordance vs. avg. coverage for NA18507 (Illumina & SOLiD reads) 100 90 % Concordance 80 70 60 Binomial (Homo) Illumina HMM-Posterior (Homo) Illumina 50 Binomial (Het) Illumina 40 HMM-Posterior (Het) Illumina 30 Binomial (Homo) SOLiD HMM-Posterior (Homo) SOLiD 20 Binomial (Het) SOLiD 10 HMM-Posterior (Het) SOLiD 0 0 1 2 3 Avg. Coverage 4 5 6 100% 100% 99% 90% 98% 80% 70% 97% 60% 96% 50% 95% 40% 94% 30% 93% 20% 92% 10% 91% 0% -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 log(cM/Mb) Concordance (homo) Concordance (het) % of homo % of het % Hapmap SNPs % Concordance Effect of local recombination rate (NA18507 Illumina) 20% 90% 18% 80% 16% 70% 14% 60% 12% 50% 10% 40% 8% 30% 6% 20% 4% 10% 2% 0% 0% SNP coverage Concordance (homo) % of homo Concordance (het) % of het % Hapmap SNPs 100% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 % Concordance Effect of SNP coverage (NA18507 Illumina) Conclusions Posterior decoding algorithm has scalable running time and yields significant improvements in genotyping calling accuracy Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by previously proposed binomial test at 5-6x average coverage is achieved by HMM-based posterior decoding algorithm using less than 1/4 of the reads Open source code available at http://dna.engr.uconn.edu/software/GeneSeq/ LD-based genotype calling increasingly attractive as reference panels improve (denser, more samples, more populations) Allows sequencing larger populations for the same cost Ongoing work Haplotype reconstruction Extension to population sequencing data Promising preliminary results using Viterbi-like algorithm based on HF-HMM Removes need for reference panels! Integrated read mapping, SNP identification, and haplotype reconstruction EM algorithm that iteratively refines two full haplotype sequences and read mapping probabilities Integrates read data with LD info available for known SNPs Takes advantage of reads overlapping multiple SNP loci Allows reconstruction of complete sequences for CNVs Reconstruction of complex haplotype spectra mRNA isoforms, quasispecies Acknowledgments Work supported in part by NSF awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF award CCF-0755373.