Genotype and Haplotype Reconstruction from LowCoverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S.

Download Report

Transcript Genotype and Haplotype Reconstruction from LowCoverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S.

Genotype and Haplotype
Reconstruction from LowCoverage Short Sequencing Reads
Ion Mandoiu
Computer Science and Engineering Department
University of Connecticut
Joint work with S. Dinakar, J. Duitama, Y. Hernández, J. Kennedy,
and Y. Wu
Outline
 Introduction
 Single SNP Genotype Calling
 Multilocus Genotyping Problem
 Experimental Results
 Conclusion
Ultra-high throughput DNA sequencing

Recent massively parallel sequencing technologies
deliver orders of magnitude higher throughput
compared to classic Sanger sequencing
Roche/454 FLX Titanium
400bp reads
400-600Mb/10h run
ABI SOLiD 3
35-50bp reads
5-7.5Gb/3.5-7 day run
Illumina Genome Analyzer II
35-75bp reads
2-3Gb/2 day run
Helicos HeliScope
25-55bp reads
~2.5Gb/day
UHTS enables personal genomics
$100,000,000
C.Venter
[email protected]
$10,000,000
Cost
$1,000,000
$100,000
J. Watson
[email protected]
NA18507
Illumina@36x
SOLiD@12x
$10,000
$1,000
$100
days
weeks
months
Sequencing Time
years
Challenges for medical applications of
sequencing

Sequencing can potentially provide all genetic variations
(SNPs, CNVs, genome rearrangements) at single-base
resolution…



However, medical use requires determination of both alleles
(genotype) at variable loci
Accurate genotype calling is limited by coverage depth due to
random nature of shotgun sequencing
For the Venter and Watson genomes (both sequenced at ~7.5x
average coverage), comparison with SNP genotyping chips has
shown only ~75% accuracy for sequencing based calls of
heterozygous SNPs [Levy et al 07, Wheeler et al 08]
Allele coverage for heterozygous SNPs
(Watson 454 @ 5.85x avg. coverage)
6
5
Variant allele coverage
4
3
2
1
0
-1
-1
0
1
2
3
Reference allele coverage
4
5
6
Allele coverage for heterozygous SNPs
(Watson 454 @ 2.93x avg. coverage)
6
5
Variant allele coverage
4
3
2
1
0
-1
-1
0
1
2
3
Reference allele coverage
4
5
6
Allele coverage for heterozygous SNPs
(Watson 454 @ 1.46x avg. coverage)
6
5
Variant allele coverage
4
3
2
1
0
-1
-1
0
1
2
3
Reference allele coverage
4
5
6
Prior work

Most prior genotype calling methods are based on allele
coverage


[Levy et al 07] and [Wheeler et al 08] require that each allele be
covered by at least 2 reads in order to be called
Combined with hypothesis testing based on the binomial
distribution when calling hets


Binomial probability for the observed number of alleles must be at
least 0.01
[Wendl&Wilson 08] generalize coverage methods to allow an
arbitrary minimum allele coverage k
Prior work (contd.)

MAQ [Li,Ruan&Durbin 08]



Widely used read mapping program
Single SNP genotype calling incorporating read
mapping confidence and quality scores
Mostly tuned for de novo SNP discovery…
What coverage is required?

[Wendl&Wilson 08] estimate that 21x coverage will be
required for sequencing of normal tissue samples based
on idealized theory that “neglects any heuristic inputs”
Do heuristic inputs help?


We propose methods incorporating additional sources of
information extracted from a reference panel such as
Hapmap:

Allele/genotype frequencies

Linkage disequilibrium
Experimental results show significantly improved
genotyping accuracy
Outline
 Introduction
 Single SNP Genotype Calling
 Multilocus Genotyping Problem
 Experimental Results
 Conclusion
Basic assumptions

Known SNP positions

Biallelic SNPs

0 = major allele, 1 = minor allele

SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous
Incorporating base call and read
mapping uncertainty

ri = set of mapped reads covering SNP locus i

For each read r in ri



r(i) = the allele observed at locus i
 qr ( i ) /10
 r (i )  10
= probability that r(i) is incorrect, where qr(i) is the
phred quality score of r(i)
mr = mapping confidence of r
012100120
Mapped reads with allele 0
Inferred genotypes
Mapped reads with allele 1
Sequencing errors
Incorporating base call and read
mapping uncertainty

ri = set of mapped reads covering SNP locus i

For each read r in ri



r(i) = the allele observed at locus i
 qr ( i ) /10
 r (i )  10
= probability that r(i) is incorrect, where qr(i) is the
phred quality score of r(i)
mr = mapping confidence of r
P(ri | Gi  0) 
 (1   r (i ) )
rri
r ( i ) 0
mr

rri
r ( i ) 1
mr
r (i )
P(ri | Gi  2) 

 1  r ri
P (ri | Gi  1)   
2
 (1   r (i ) )
rri
r ( i ) 1
mr
mr
m


r
rri
r ( i ) 0
r (i )
Single SNP genotype calling

Applying Bayes’ formula:
P( gi ) P(ri | Gi  gi )
P(Gi  gi | ri ) 
g{0,1,2} P(Gi  gi )P(ri | Gi  g)

Where P(Gi  gi ) are genotype frequencies inferred from a
representative panel
Outline
 Introduction
 Single SNP Genotype Calling
 Multilocus Genotyping Problem
 Experimental Results
 Conclusion
Haplotype structure in human populations
HMM model of haplotype frequencies

Similar models proposed in [Schwartz 04, Rastas et
al. 05, Kennedy et al. 07, Kimmel&Shamir 05,
Scheet&Stephens 06]
Graphical Model Representation

F2
H1
H2
…
Fn
Hn
Random variables



F1
Fi = founder haplotype at locus i
Hi = observed allele at locus i
For fully specified model and given haplotype h,
P(H=h|M) can be computed in O(nK2) using forward
algorithm, where n=#SNPs, K=#founders
HF-HMM for multilocus genotype inference
F1
F2
H1
H2
F'1
F'2
H'1
H'2
…
Fn
Hn
…
F'n
H'n
P(f1), P(f’1), P(fi+1|fi),
G1 P(f’i+1|f’i), P(hi|fi),
G2 P(h’i|f’i) trained using
Gn
Baum-Welch algorithm on haplotypes inferred from the populations
of origin for mother/father
R1,1
…
R1,c1
R2,1
…
R2,c2
Rn,1
…
Rn,cn
HF-HMM for multilocus genotype inference
F1
F2
H1
H2
F'1
F'2
H'1
H'2
G1
R1,1
…
…
Fn
Hn
…
F'n
H'n
G2
Gn
P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0
R2,1
…
R2,c2
Rn,1
…
Rn,cn
otherwise
R1,c1
HF-HMM for multilocus genotype inference
F1
F2
H1
H2
F'1
F'2
…
Fn
Hn
…
F'n
P(Ri, j  r | Gi  gi ) definedas in single SNP case
H'1
H'2
G1
R1,1
…
H'n
G2
R1,c1
R2,1
…
Gn
R2,c2
Rn,1
…
Rn,cn
Multilocus genotyping problem
GIVEN:
• Shotgun read sets r=(r1, r2, … , rn)
• Trained HMM models representing LD in populations of origin for
mother/father
• Quality scores & read mapping confidence values
FIND:
• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum
posterior probability, i.e., g*=argmaxg P(g | r)
Computational complexity
Theorem: maxgP(g | r) cannot be approximated within O(n1 )
unless ZPP=NP
Idea: reduction from the clique problem
Posterior decoding algorithm
1. For each i = 1..n, compute
gi *  argmaxgi P( gi | r)  argmaxgi P( gi , r)
2. Return g*  ( g1*,...,g n *)
Forward-backward computation
P( gi , r )  P(ri | gi ) f
K
…
i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
'
'
i
i
fi
i
'
i
'
i
i
i
…
hi
…
f’i
…
h’i
gi
r1,1
…
r1,c 1
ri,1
…
ri,c
i
Rn,1
…
Rn,cn
Forward-backward computation
P( gi , r )  P(ri | gi ) f
K
…
i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
'
'
i
i
fi
i
'
i
'
i
i
i
…
hi
…
f’i
…
h’i
gi
r1,1
…
r1,c 1
ri,1
…
ri,c
i
Rn,1
…
Rn,cn
Forward-backward computation
P( gi , r )  P(ri | gi ) f
K
…
i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
'
'
i
i
fi
i
'
i
'
i
i
i
…
hi
…
f’i
…
h’i
gi
r1,1
…
r1,c 1
ri,1
…
ri,c
i
Rn,1
…
Rn,cn
Forward-backward computation
P( gi , r )  P(ri | gi ) f
K
…
i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
'
'
i
i
fi
i
'
i
'
i
i
i
…
hi
…
f’i
…
h’i
gi
r1,1
…
r1,c 1
ri,1
…
ri,c
i
Rn,1
…
Rn,cn
Forward-backward computation
P( gi , r )  P(ri | gi ) f
K
…
i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
'
'
i
i
fi
i
'
i
'
i
i
i
…
hi
…
f’i
…
h’i
gi
r1,1
…
r1,c 1
ri,1
…
ri,c
i
Rn,1
…
Rn,cn
Runtime

Direct recurrences for computing forward probabilities:
 1f , f  P ( f1 ) P ( f1' )
'
i


i
i
f i , f i'

K
K
 
f i1 1 f i'1 1
i 1
f i1 , f i'1
P( f i | f i 1 )P( f i | f )
'
'
i 1
i 1
f i1 , f i'1
( gi 1 )
Runtime reduced to O(nK3) by reusing common terms:
 if , f 
'
i
i
where

i
f i1 , f i'

K

f i'1 1
'
'
i 1
'
'
i 1

P
(
f
|
f
)

( gi 1 )
 f ,f
i
i 1
f ,f
f i'1 1
i
i 1 , f i
K
 if , f ( gi ) 
i
P( f i | f i 1 )  if
i 1
'
i 1
i 1
'
i 1
'
'
P
(
h
|
f
)
P
(
h
|
f
 i i i i )P(ri | Gi  hi  hi )
hi ,hi' {0,1}
Outline
 Introduction
 Single SNP Genotype Calling
 Multilocus Genotyping Problem
 Experimental Results
 Conclusion
Pipeline for LD-Based Genotype Calling
Mapped reads &
confidence values
Read sequences
Reference genome
sequence
>gnl|ti|1779718824 name:EI1W3PE02ILQXT
TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT
>gnl|ti|1779718824
name:EI1W3PE02ILQXT
GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT
TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT
CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT
GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT
>gnl|ti|1779718824
name:EI1W3PE02ILQXT
TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC
CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT
TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT
>gnl|ti|1779718825 name:EI1W3PE02GTXK0
TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC
GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT
TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT
>gnl|ti|1779718825
name:EI1W3PE02GTXK0
CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT
TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC
TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT
TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC
AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA
TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC
>gnl|ti|1779718825
name:EI1W3PE02GTXK0
GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA
TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT
GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC
AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA
GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA
>gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference
assembly
GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG
CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
>gi|88943037|ref|NT_113796.1|Hs1_111515
Homo sapiens chromosome 1 genomic contig, reference
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT
assembly
AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC
GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG
CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT
>gi|88943037|ref|NT_113796.1|Hs1_111515
Homo sapiens chromosome 1 genomic contig, reference
CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA
assembly
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT
AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG
GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG
AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC
TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA
CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT
ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT
TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA
Quality scores
AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC
AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG
CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT
TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA
TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA
ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG
TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA
ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
>gnl|ti|1779718824 name:EI1W3PE02ILQXT
28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 7
27 42 35 21 6 28name:EI1W3PE02ILQXT
43 36 22 10 27 42 35 20 6 28 43 36 22 9
>gnl|ti|1779718824
2840
4434
3614
2444
1436423
2813
28 28
2742
2835
2621
26735 26
28 2828
2843
2836
2622
28928
2 27
>gnl|ti|1779718824
3 name:EI1W3PE02ILQXT
28
27 4240
3534
21186 28
4328
3628
2227
1033
2724
4226
3528
2028628
28 40
43 33
36 14
22 28
9 36 27
28 2828
2843
28
26
28
22827
35
21
26
26
37928
29
28
28
28
28
27
2813
2828
3742
28
27
27735
2826
36 28 37
36
22
2840
4434
3614
2444
1436
423
28
28
27
28
26
26
27 4240
3534
21
6
28
43
36
22
10
2724
4228
3528
2027
2828
4337
3629
2236
9 27
28
28
28
27
28
28
18 3 28 28 28 27 28
33
24
26
28
28628
28
40
33
14
28
36 27
27 28 27
28 4326
3626
22
9
28
44
36
24
14
4
28
28
28
27
28
26
26
35
26
28
33
23
28
33
23
28
36
27
33
23
28
35
25
28
28
36
37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 2827
3736 27
40 3428
1828328
28
28
28
27
33
24
26
28
28
28
40
33
14
28
36
27
28
28
24
28
37
29
28
19
28
26
37
29
26
39
33
13
37
28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 2728 28
26 2628
3733
29
28
28
28
28
27
28
28
28
37
28
27
27
28
36
28
37
28
21
24
28
27
41
34
15
28
36
27
26
28
24
35
27
28
23
28
33
23
28
36
27
33
23
28
35
25
28
28
36
27
3640
2734 15
28 2828
2828
2728
2824
2828
2837
2429
2828
2819
2728
2826
2837
3729
2926
3639
2733
2713
2837
2728 28
28 3328
2321
2824
3328
2327
2841
3634
2715
3328
2336
2827
3526
2528
2824
2835
3627
2728
3640
2734 15
28 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 28
28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
SNP genotype calls
Hapmap haplotypes
90 209342
16 F 0 0
2110001?0100210010011002122201210211?1221220212000
90 209342
16 F 0180 F 15 16
21100012010021001001100?100201?10111110111?0212000
2110001?0100210010011002122201210211?1221220212000
15 16
M00
18
F
15
90 20934221120010012001201001120010110101011111011110212000
16 F 021100012010021001001100?100201?10111110111?0212000
0 7M 00
15
M
00
2110001?0100210010011002122201210211?1221220212000
2110001001000200122110001111011100111?121210222000
21120010012001201001120010110101011111011110212000
18 F 15
168 F 0 0
7M 00
21100012010021001001100?100201?10111110111?0212000
011202100120022012211200101101210211122111?0120000
15 M2110001001000200122110001111011100111?121210222000
00
8 F 0 12
0 F 9 10
21120010012001201001120010110101011111011110212000
21100010010002001221100010110111001112121210220000
7 M 0011202100120022012211200101101210211122111?0120000
0
M 00
12 F 99 10
2110001001000200122110001111011100111?121210222000
011?001?012002201221120010?10121021112211110120000
8 F 0 21100010010002001221100010110111001112121210220000
0 11 M 7 8
9
M
0
0
011202100120022012211200101101210211122111?0120000
21100210010002001221100012110111001112121210222000
011?001?012002201221120010?10121021112211110120000
12 F 9 10
11 M 7 8
21100010010002001221100010110111001112121210220000
9 M 021100210010002001221100012110111001112121210222000
0
011?001?012002201221120010?10121021112211110120000
11 M 7 8
21100210010002001221100012110111001112121210222000
…
…
…
…
…
…
…
rs12095710 T T 9.988139e-01
rs12127179 C T 9.986735e-01
rs11800791 G G 9.977713e-01
rs11578310 G G 9.980062e-01
rs1287622 G G 8.644588e-01
rs11804808 C C 9.977779e-01
rs17471528 A G 5.236099e-01
rs11804835 C C 9.977759e-01
rs11804836 C C 9.977925e-01
rs1287623 G G 9.646510e-01
rs13374307 G G 9.989084e-01
rs12122008 G G 5.121655e-01
rs17431341 A C 5.290652e-01
rs881635 G G 9.978737e-01
rs9700130 A A 9.989940e-01
rs11121600 A A 6.160199e-01
rs12121542 A A 5.555713e-01
rs11121605 T T 8.387705e-01
rs12563779 G G 9.982776e-01
rs11121607 C G 5.639239e-01
rs11121608 G T 5.452936e-01
rs12029742 G G 9.973527e-01
rs562118 C C 9.738776e-01
rs12133533 A C 9.956655e-01
rs11121648 G G 9.077355e-01
rs9662691 C C 9.988648e-01
rs11805141 C C 9.928786e-01
rs1287635 C C 6.113270e-01
Datasets

Watson



Sequencing data: 74.4 million 454 reads (of 106.5
million reads used in [Wheeler et al 08])
Reference panel: CEU genotypes from Hapmap r23a
phased using the ENT algorithm [Gusev et al. 08]
Ground truth: duplicate Affymetrix 500k SNP genotypes
Read length distribution
2000000
1500000
1000000
500000
0
0
100
200
300
400
500
600
700
800
900
1000 1100 1200
Datasets (contd.)

NA18507 (Illumina & SOLiD)

Sequencing data: 525 million Illumina reads (36bp,
paired) and 764 million SOLiD reads (24 - 44bp,
unpaired)
Reference panel: YRI haplotypes from Hapmap r22

Ground truth: Hapmap r22 genotypes

excluding NA18507 haplotypes
Mapping Procedure

454 reads mapped on human genome build 36.3 using
the NUCMER tool of the MUMmer package [Kurtz et al
04] with default parameters



Additional filtering: at least 90% of the read length matched to
the genome, no more than 10 errors (mismatches or indels)
Reads meeting above conditions at multiple genome positions
(likely coming from genomic repeats) were discarded
Illumina and SOLiD reads mapped using MAQ
[Li,Ruan&Durbin 08] with default parameters


For reads mapped at multiple positions MAQ returns best position
(breaking ties arbitrarily) together with mapping confidence
We filtered bad alignments and discarded paired end reads that
are not mapped in pairs using the “submap -p” command
Mapping statistics
Dataset
Raw
reads
Raw
sequence
Mapped
reads
Test
SNPs
Avg. mapped
SNP cov.
Watson
74.2M
19.7Gb
49.8M
(67%)
443K
5.85x
2.85M
6.10x
2.85M
3.21x
NA18507
Illumina
525M
18.9Gb
397M
(78%)
NA18507
SOLiD
764M
21.15Gb
324M
(42%)
Concordance vs. avg. coverage
(Watson 454 reads)
100
90
% Concordance
80
70
60
50
Binomial (Homo)
HMM-Posterior (Homo)
Binomial (Het)
40
30
HMM-Posterior (Het)
20
10
0
0
1
2
3
Avg. Coverage
4
5
6
Tradeoff with call rate
(5.85x Watson 454 reads, homo SNPs)
100
% concordance
99.5
99
98.5
98
97.5
97
0
10
20
30
40
% uncalled
1SNP-Posterior
Binomial0.01
HMM-Posterior
50
Tradeoff with call rate
(5.85x Watson 454 reads, het SNPs)
100
98
% concordance
96
94
92
90
88
86
84
82
80
0
5
10
15
20
25
30
35
40
% uncalled
1SNP-Posterior
Binomial0.01
HMM-Posterior
45
50
Concordance vs. avg. coverage for
NA18507 (Illumina & SOLiD reads)
100
90
% Concordance
80
70
60
Binomial (Homo) Illumina
HMM-Posterior (Homo) Illumina
50
Binomial (Het) Illumina
40
HMM-Posterior (Het) Illumina
30
Binomial (Homo) SOLiD
HMM-Posterior (Homo) SOLiD
20
Binomial (Het) SOLiD
10
HMM-Posterior (Het) SOLiD
0
0
1
2
3
Avg. Coverage
4
5
6
100%
100%
99%
90%
98%
80%
70%
97%
60%
96%
50%
95%
40%
94%
30%
93%
20%
92%
10%
91%
0%
-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
log(cM/Mb)
Concordance (homo)
Concordance (het)
% of homo
% of het
% Hapmap SNPs
% Concordance
Effect of local recombination rate
(NA18507 Illumina)
20%
90%
18%
80%
16%
70%
14%
60%
12%
50%
10%
40%
8%
30%
6%
20%
4%
10%
2%
0%
0%
SNP coverage
Concordance (homo)
% of homo
Concordance (het)
% of het
% Hapmap SNPs
100%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
% Concordance
Effect of SNP coverage
(NA18507 Illumina)
Conclusions

Posterior decoding algorithm has scalable running time and
yields significant improvements in genotyping calling
accuracy



Improvement depends on the coverage depth (higher at lower
coverage), e.g., accuracy achieved by previously proposed binomial
test at 5-6x average coverage is achieved by HMM-based posterior
decoding algorithm using less than 1/4 of the reads
Open source code available at
http://dna.engr.uconn.edu/software/GeneSeq/
LD-based genotype calling increasingly attractive as
reference panels improve (denser, more samples, more
populations)

Allows sequencing larger populations for the same cost
Ongoing work

Haplotype reconstruction


Extension to population sequencing data



Promising preliminary results using Viterbi-like algorithm based on
HF-HMM
Removes need for reference panels!
Integrated read mapping, SNP identification, and haplotype
reconstruction

EM algorithm that iteratively refines two full haplotype sequences and
read mapping probabilities

Integrates read data with LD info available for known SNPs

Takes advantage of reads overlapping multiple SNP loci

Allows reconstruction of complete sequences for CNVs
Reconstruction of complex haplotype spectra

mRNA isoforms, quasispecies
Acknowledgments

Work supported in part by NSF awards IIS-0546457 and
DBI-0543365 to IM and IIS-0803440 to YW. SD and YH
performed this research as part of the Summer REU
program “Bio-Grid Initiatives for Interdisciplinary Research
and Education" funded by NSF award CCF-0755373.