Genome Assemblies, Chinese Postmen and Virtual Clusters
Download
Report
Transcript Genome Assemblies, Chinese Postmen and Virtual Clusters
Canadian Bioinformatics Workshops
www.bioinformatics.ca
Next Generation Sequencing:
Mapping & Assembly
Michael Brudno
Department of Computer Science
Banting & Best Dept of Medical Research
University of Toronto
14/05/2008
Questions to Ask
• What do you want to learn?
• Do you have a finished similar genome?
• Do you have matepair data?
•Do you have other read types?
•What compute power do you have available?
What are your options
• What do you want to learn?
-- Variation (in a specie)
-- Conservation (between species)
• You can try to build a genome from scratch
• You can do reference assisted assembly
-- However you will miss some differences
• You can map reads to a reference genome
-- Great if you want to discover variation
-- What if you don’t have a reference?
Mapping versus Assembly
Genome re-sequencing: somatic mutation detection, organismal SNP discovery,
mutational profiling, structural variation discovery
reference genome
SNP
Dealing with nonuniqueness in the genome:
resequenceability
De novo genome sequencing
DEL
Mapping NGS Data
• NGS: at least 3 distinct read types:
– Illumina/Solexa, 454
letter-space
– AB SOLiD
color-space (di-base sequencing)
– 2-pass SMS (Helicos)
2 reads, same location
higher error rates
• Goal: to ALIGN the reads to the reference(BLAST?)
• Need new algorithms
– More data
– SOLiD: Biologists want letters, not colors
– 2-pass: How to best handle two reads?
If I had a VERY fast computer
A
C
T
A
G
A
C
T
T
G
T
C
C
A
G
T
Cell being computed
Previously computed cells
M i, j
M i1, j1 S(Ai1,B j1 )
max
M i1, j gap
M i, j1 gap
Finding Exact Matches
Lookup Table
Level 1: size 41
A
C
G
T
210
233
267
417
Implementation
Lookup table
Level 2: size 42
A
C
G
T
A
C
G
T
A
C
G
T
A
C
G
T
A
C
G
T
50
55
60
45
5
6
4
8
13
7
4
10
61
19
28
42
Implementation
Lookup table
Level k: 4k
416 = OUT OF MEMORY
0 0 0 0 0 0
62
17
32
43 0 0 0 0
Spaced Seed Filtering
• Slide fixed window over genome
• Index reads by each seed in genome
Reads
Genome
SHRiMP Overview
Isolate similarity in stages:
1. Spaced Seed Filtering
2. Vectorized Smith-Waterman
}
Common
3. Full Alignment
–
4.
Specialized for SOLiD, 2-pass, Letter-space
Compute p-values (and other statistics)
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
AB SOLiD: Color-space Sequencing
AB SOLiD reads look like this:
T012233102
TGAGCGTTC
T012033102
TG A ATAGGA
A
0
1
C
1
0
G
2
3
0
1
T
3
2
1
0
C
G
T
0
0
TGAGCGTTC
A
2 |||
3
TGAATAGGA
1
3 2
A
C
0
2
3
G
3
2
1
T
0
AB SOLiD: Color space is complex!
INDELS
SNPs
G: TTGAGTTATGGAT
012210331023
R: 012120331023
TTGACTTATGGAT
TGAGTT
12210
TGACTT
12120
TGAATT
12030
TGATTT
12300
TGAGTTA
122103
TGA-TTA
12-303
TGAGTTTA
1221003
TGAGTATA
1221333
AB SOLiD: Translations
• Look at: 012233102
TGAGCGTTC
|||||||||
|||
• Recall: 012033102
TGAGCGTTC
TGAATAGGA
• 4 translations for every color sequence
0
1
A A C
2
T
0
T
3
A
3
T
C C A G G C G
G G
T
T
G A A
T
C
1
0
2
T
T
C
A C
C
T
2
A
G G A
C G C A A G
T
0
0
1
3
C
0
G
3
2
1
T
0
AB SOLiD: Modified Smith-Waterman
• 4 S-W matrices, one per translation
• Errors transition into other matrix
• ‘Crossover’ penalty charged for errors
TTG GATAC C T C CA AG C GTTC
TTG AGCGTTC
Genome
…
Translation A
Translation C
AB SOLiD: Obligatory Comparison
• SHRiMP and AB Mapper (1.6)
– SHRiMP seed 1111001111
– AB 35_2, 35_3 schemas
• 10,000 35mers
– C. savignyi (173Mb), very high polymorphism
• Considering single top hits only
SHRiMP
AB 35_2
AB 35_3
% mapped
19.83
6.67
10.94
Runtime
13m04
1h24
2h25
AB SOLiD: Resultant Alignments
• SHRiMP emits letter-space alignments
G: 798
GAACCCCTTACAACTGAACCCCTTAC
||X||||||||||||||||||| |||
GAaCCCCTTACAACTGAACCCC-TAC
1 T1211000203110121201000-231
823
T:
R:
25
– Clear to biologists
– Color-space need not be scary!
Outline
1. AB SOLiD Reads
2. 2-pass (SMS) Reads
2-pass SMS Reads
• SMS reads have high error rates
– “Dark bases” (skipped letters)
– Multiple passes are possible
– Ameliorate errors over passes
• Good chance of missing base in one read
• Acceptable chance of getting it in at least one
SMS 2-pass: SHRiMP with 2 reads
CTG-ACT
CAGCA-T
C
S=9
T
G
A
C
T
C
A
G
C
A
T
Match = +4
Mismatch = -3
Gap = -2
SMS 2-pass: SHRiMP with 2 reads
CTG-ACT
CAGCA-T
CTGAC-T
CAG-CAT
C
S=9
C
A
G
C
A
T
T
G
A
C
T
9
1
1
9
8
8
9
1
1
9
9
1
9
3
9
1
9
9
1
1
9
Match = +4
Mismatch = -3
Gap = -2
SMS 2-pass: SHRiMP with 2 reads
C
CTG-ACT
CAGCA-T
CTGAC-T
CAG-CAT
C
S=9
A
C-TG-ACT
CA-GCA-T
CT-GAC-T
C-AG-CAT
C
C
A
—
—
T
—
T
8
8
9
1
1
9
9
1
9
3
9
1
9
9
1
1
9
A
—
—
A
T
T
G
G
—
C
A
A
C
—
T
9
T
C
C
C
1
A
A
—
A
1
C
A
T
G
9
G
S=8
T
Match = +4
Mismatch = -3
Gap = -2
SMS 2-pass: SHRiMP with 2-pass data
• Build a DAG representing the (near) optimal
alignments of the two reads
A
T
C
C
A
—
—
T
—
T
A
—
A
—
C
C
—
A
T
T
G
G
—
C
A
A
C
—
• Generate seeds (short paths) from the DAG
• Do k-mer scan; if seeds encountered align both
reads to the location using vectorized SW.
• Do full WSG alignment for top hits
SMS 2-pass: Results (in brief)
• 10,000 synthetic reads (~25-65 bp)
• 7% deletion,1% insertion, 1% sub rate
• Mapped to Human chromosome 1
• Spaced seed span 9: 111110111
Type
Separate
Profile
WSG
No hits %
0.13
4.91
4.31
Multiple %
26.45
9.34
9.13
Uniq cor %
63.00
74.90
75.84
Runtime
9m
11m
12m
SHRiMP Statistics
• p_chance -- The Probability that the hit matches
the genome by chance
• p_genome -- The probability that the hit has that
many or more mutations
• Normalized odds -- Normalized ratio
p_genome/p_chance
• One read is never sufficient to call a SNP!!!
Statistics Primer
You get tested for a horrible disease which only
strikes 1/thousand people. You test positive. The
test has only a 1% false positive rate (and no false
negatives). What are the odds that you have the
disease?
Test 1 thousand people.
-- 1 will have the disease
-- 10 will test false positive
SNPs are 1/1000, but harmful ones are rarer
SHRiMP Summary
• Fast mapping of short reads to a genome
-- Handles:
• color-space (SOLiD) reads
• 2-pass (SMS) reads
• insertions and deletions
-- Easy to parallelize
• Computation of p-values & other statistics for hits
Whole Genome Shotgun Sequencing
DNA
SEQUENCER
Sanger vs. NGS
reads
ASSEMBLER
C++
contigs
FINISHING
sequence
Insert Size ± Error
Assembler
• Input: set of strings over {A,C,G,T} called reads
• Output: A common superstring of the reads.
– {TACAT, CATAC, ACGTAC} TACATACGTAC
• Initially: Shortest Common Superstring (SCS)
– NP-hard [Gallant et al 1980]
– Over-collapsing of repeats
• Alternate approaches
– de Bruijn graphs [Pevzner, Tang, Waterman 01]
– string graphs [Myers 05]
• Both formulations are NP-hard.
De Bruijn Graphs
• Nodes are (k-1)-mers
• Edges are k-mers
• The set of k-mers is
called a k-spectrum
• Finding shortest string
with given k-spectrum
equivalent to Chinese
Postman
{AGC, ATC, ATT, CAG,
CAT, GCA, TCA, TTC}
TT
TC
AT
CA
GC
AG
Pevzner 1989
Chinese Postman Tours
• Solving Chinese
Postman: An Eulerian
tour is a solution
• Euleriazation: making a
graph Eulerian
• Can be done with
min cost flow:
– Unbalanced nodes are
sources/sinks
– Duplicate all edges used in
flow
{AGC, ATC, ATT, CAG,
CAT, GCA, TCA, TTC}
TT
TC
AT
CA
GC
AG
Pevzner 1989
Motivation: String Graphs
• Several downsides of the de Bruijn approach
– Division into k-mers arbitrary
– Very sensitive to sequencing errors
– Not memory efficient (one node per k-mer)
• Goal
– One node per read (or better)
– No division into k-mers
– Flexibility in the presence of sequencing errors
Myers 2005
How To Build A String Graph (1)
{ACGTAC, CATAC, TACAT}
• Nodes are reads
• Edges are overlaps
• Weights are lengths of nonoverlapping prefix
• Transitively inferable overlaps
TACATACGTAC
ACGTAC
3
5
3
CATAC
2
2
TACAT
How To Build A String Graph(2)
• Build overlap graph
• Remove transitively inferable overlaps
• Collapse Chains
Required
Goal: Find the shortest path using all of the required
edges and any optional ones
Optional
Myers 2005
Simpifying String Graphs
Repeatedly collapse vertices using 4 simple rules
1
1
1
1
2
1
2
1
1
1
2
1
2
1
1
1
1
1
2
1
1
1 1
1
2
1
1
1
1
After all collapsals are performed, the remaining nodes
(conflict nodes) all have in and out degree of at least 2.
1
1
Resolving Conflict Nodes (Repeats)
With Mate Pairs
A’
A B
A
B
B’
A’
B’
?
Does there exist a short path
between A’ and B’?
• Dijkstra’s shortest path algorithm -- bounded
• Greedily join edges if they have enough supporting reads.
Modeling Double-Strandedness
3’
+AAC
-GTT
+CTT
-AAG
+AAC
-GTT
+TCG
-CGA
+TGG
-CCA
AA C
TGG
AA C
Kececioglu 1992
CTT
TCG
+AAC
-GTT
AA C
GGCAAT
5’
ATTGCC
5’
• How can two DNA molecules overlap?
3’
Bidirected Graphs
• A walk has to “match” directions at each node.
GGCAAT
ATTGCC
+AT
-AT
+AA
-TT
+AC
-GT
+GC
-GC
+CA
-TG
+CC
-GG
Using de Bruijn Graphs
The shortest walk that visits every edge at least once (a Chinese postman tour) is the
shortest string with the given k-spectrum [Pevzner 1989]
3’
GTTGGCAAT
ATTGCCAAC
CA
5’
5’
{AAC, ATT, CAA, CCA,
{GTT, TAA, TTG, TGG,
GCC, TGC, TTG}
GGC, GCA, CAA}
3’
AA
CC
AC
GC
CA
AA
CC
AT
AC
GC
AT
GT
GG
TG
TT
GT
GG
TG
TT
Using Bidirected de Bruijn Graphs
The shortest walk that visits every edge at least once (a Chinese postman
tour) is the shortest DNA molecule with the given k-spectrum
3’
GTTGGCAAT
5’
ATTGCCAAC
5’
•
{AAC, ATT, CAA, CCA,
{GTT, TAA, TTG, TGG,
GCC, TGC, TTG}
GGC, GCA, CAA}
3’
+AT
-AT
+AA
-TT
+AC
-GT
+AT
-AT
+AA
-TT
+AC
-GT
+GC
-GC
+CA
-TG
+CC
-GG
+GC
-GC
+CA
-TG
+CC
-GG
Some Comparisons
Assemblers
• Previous (Sanger) Assemblers
• NGS Assemblers
–
–
–
–
–
–
–
–
SSAKE (Jeck et al., 2007)
VCAKE (Warren et al. 2007)
ALLPATHS (Butler et al. 2008)
Edena (Hernandez et al. 2008)
Euler-SR (Chaisson and Pevzner 2008)
SHARCGS (Dohm et al. 2007)
Shorty (Chen and Skiena 2007)
Velvet (Zerbino and Birney, 2008)
Acknowledgments
Paul Medvedev
Assembly
Seunghak Lee
Elango Cheran
Variation
Stephen Rumble
SHRiMP
Vladimir Yanovsky
http://compbio.cs.toronto.edu/shrimp
Rest of the Group
FUNDING: NSERC, CFI, NIH
Acknowledgments
Paul Medvedev
Seunghak Lee
Elango Cheran
Assembly
Variation
Stephen Rumble
SHRiMP
Vladimir Yanovsky
http://compbio.cs.toronto.edu/shrimp
Rest of the Group
FUNDING: NSERC, CFI, NIH