pptx - UC San Diego

Download Report

Transcript pptx - UC San Diego

Improving the Accuracy
of Genome Assemblies
July 17th 2012
Roy Ronen*,1, Christina Boucher*,1, Hamidreza Chitsaz2 and Pavel Pevzner1
1. University of California, San Diego
2. Wayne State University, Michigan
* Contributed equally to this work
≈ $ thousands
≈ several weeks
≈ two people
≈ $ billions
≈ several years
≈ hundreds of people
2
High Throughput Sequencing Assemblies
3
Genome
SampleDraft
Preparation
from HTS
Fragments
Sequencing
Reads
Assembly
Contigs
Analysis, Analysis,
Analysis
4
Sample Preparation
Fragments
Sequencing
Reads
Assembly
Contigs
Analysis, Analysis,
Analysis
HTS assemblies (contigs) still
contain an abundance of error:
• 20-30 subst. errors per
100kbp with SOAPdenovo.
• 5-20 subst. errors per 100kbp
with Velvet.
• Small (<50 bp) INDEL errors.
• Misassemblies, large INDELs,
etc.
5
Sample Preparation
Fragments
Sequencing
Reads
Assembly
Errors in the assembled
contigs will profoundly affect
any downstream analysis.
Contigs
Analysis, Analysis,
Analysis
6
Sample Preparation
Fragments
Sequencing
Reads
Assembly
SEQuel
Contigs
Analysis, Analysis,
Analysis
Refined Contigs
7
De Bruijn Graph
for Fragment Assembly
De Bruijn Graph
GCC
GCC
CCA
CCA
CCT
CTA
CCT
(Pevzner, Tang, Waterman 2001)
CCT
CAT
CTA
CTT
CAT
ATT
TAT
TAT
CTT
TTT
ATT
TTA
ATT
TTT
TTA
9
De Bruijn Graph
GCC
GCC
CCA
CCA
CCT
CTA
CCT
(Pevzner, Tang, Waterman 2001)
CCT
CAT
CTA
CTT
CAT
ATT
TAT
TAT
CTT
TTT
ATT
TTA
ATT
TTT
TTA
10
De Bruijn Graph
GCC
CCA
CCT
GCC
CCT
(Pevzner, Tang, Waterman 2001)
CAT
CTA
CCT
CTA
CTT
CAT
ATT
TAT
TAT
CTT
TTT
ATT
TTA
ATT
TTT
TTA
11
De Bruijn Graph
GCC
CCA
CAT
CTA
GCC
CCT
(Pevzner, Tang, Waterman 2001)
CTA
CTT
CAT
ATT
TAT
TAT
CTT
TTT
ATT
TTA
ATT
TTT
TTA
12
De Bruijn Graph
CCA
GCC
CAT
CTA
CCT
(Pevzner, Tang, Waterman 2001)
CTA
CTT
CAT
ATT
TAT
TAT
CTT
TTT
ATT
TTA
ATT
TTT
TTA
13
De Bruijn Graph
14
Challenges
GCC
CCT
CTA
TAG
AGG
GGA
GAC
CAC
ACT
CTT
TTG
TGG
GGC
GCA
GCCTAGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
..............GCCTAGGAC.............CACTTGGCA..............
16
Sequencing errors cause bulges in the de Bruijn graph
GCC
CCT
CTA
TAG
AGG
GGA
GAC
CAC
ACT
CTT
TTG
TGG
GGC
GCA
TGGA
TTGA
CTTG
CCTT
GCCTTGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
..............GCCTAGGAC.............CACTTGGCA..............
17
Sequencing errors cause bulges in the de Bruijn graph
CTA
2
GCC
3
CCT
3
ACT
1
3
TAG
2
AGG
2
GGA
CTT
CAC
2
4
TTG
4
3
1
GAC
TGG
3
GGC
3
GCA
GCCTTGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
..............GCCTAGGAC.............CACTTGGCA..............
18
Sequencing errors cause bulges in the de Bruijn graph
GCC
3
CCT
GGA
1
CTT
CAC
3
ACT
3
4
TTG
4
3
1
GAC
TGG
3
GGC
3
GCA
......GCCTTGGAC...... ......CACTTGGCA......
GCCTTGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
GCCTAGGAC
CACTTGGCA
..............GCCTAGGAC.............CACTTGGCA..............
19
The SEQuel Algorithm
Sample Preparation
Fragments
Sequencing
Reads
Assembly
SEQuel
Contigs
Analysis, Analysis,
Analysis
Refined Contigs
21
The SEQuel Algorithm
53
12
19
21
25
26
32
34
40
39
8
44
57
21
68
29
75
34
81
89
Permissively aligned read-pair: a read-pair for
which at least one read aligned uniquely.
22
Positional De Bruijn Graph
23
Positional De Bruijn Graph
Positional
k-mer:
a pair (k-mer,
(GCCA,ATT,114
111).
CCA,112
CCA,112
CAT,113 position),
CAT,113 e.g.
ATT,114
GCC,111
CCT,112
CTT,113
CCT,976
CTT,113
CTA,977
TTT,114
CTA,977
TTT,114
TAT,978
TTA,115
TAT,978
GCC,975
TTA,115
CCT,976
ATT,979
24
Positional De Bruijn Graph
GCC,111
CCA,112
CCA,112
CCA,112
CAT,113
CAT,113
ATT,114
ATT,114
ATT,114
TTA,115
CCT,112
CTT,113
TTT,114
TTT,114
TTA,115
CCT,976
CCT,976
CTT,113
CTA,977
CTA,977
TAT,978
TAT,978
GCC,975
ATT,979
25
Positional De Bruijn Graph
4
4
4
4
26
The SEQuel Algorithm
partial contig #1: GCCATTA
partial contig #2: GCCTATT
Original contig
GTATTCCGAGGACCACTGGATTATGA
27
The SEQuel Algorithm
GTATTCCGAGGACCACTGGATTATGA
28
The SEQuel Algorithm
GTATTCCGAGGACCAC---TGGATTATGA
GCGGGCCGAGGA
CAAATGGATTACGA
29
The SEQuel Algorithm
GTATTCCGAGGACCAC---TGGATTATGA
GCGGGCCGAGGA
CAAATGGATTACGA
30
The SEQuel Algorithm
GCGGGCCGAGGACCAC---TGGATTATGA
GCGGGCCGAGGA
CAAATGGATTACGA
31
The SEQuel Algorithm
GCGGGCCGAGGACCAC---TGGATTATGA
GCGGGCCGAGGA
CAAATGGATTACGA
32
The SEQuel Algorithm
GCGGGCCGAGGACCACAAATGGATTACGA
GCGGGCCGAGGA
CAAATGGATTACGA
33
The SEQuel Algorithm
GCGGGCCGAGGACCACAAATGGATTACGA
Repeat for all contigs.
34
Results
• Standard and Single-Cell E. coli.
• 100 bp paired-end, Illumina (GAII) reads.
• Mean coverage ≈ 600x.
• Assemblies compared to reference with & without
SEQuel.
35
Standard E. coli
36
Standard E. coli
37
Single Cell Sequencing
Standard
(Chitsaz et al., 2011)
Single Cell
38
Single Cell E. coli
39
Single Cell E. coli
40
Summary
• Removed 35% to 96% of small-scale assembly errors.
• Introduced positional de Bruijn graph for contig refinement.
• Demonstrated utility in hard (single-cell) assembly.
• SEQuel can be used in combination with any assembler.
• Freely available at: http://bix.ucsd.edu/SEQuel
41
Acknowledgments
3P41RR024851-02S1
CCF-1115206