Genome Assembly- Background and Strategy

Download Report

Transcript Genome Assembly- Background and Strategy

Genome Assembly
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR,
Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick
Outline








Stake Holders
Biology
NGS Review
Introduction to Genome Assembly
Challenges
Analysis pipeline/ strategy
Tool selection
Summary (final pipeline)
Stakeholders





CDC (Centers for Disease Control and Prevention)
GaTech
Immunocompromised individuals
Consumers of seafood
Prediction group (and subsequent groups)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Biology…
Image of V. vulnificus
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio vulnificus

Gram-negative
o



Lipopolysaccharide membrane
Motile, facultative anaerobe
Halophilic (salt-loving) organism
abundant in estuarine ecosystems
Major cause of seafood related
deaths
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio vulnificus – genome architecture

Bacterial genomes are codingdense
o


Introns rare
Contains plasmids (pYJ016)
V. vulnificus ~5.2mbp genome
(similar to E. coli, ~50%)
o
GC content: 45-47%
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio navarrensis

Gram-negative



Lipopolysaccharide membrane
Motile, facultative anaerobe
Moderately halophilic organism

Some strains do not grow well in
moderate to high salt
concentrations
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Vibrio navarrensis - genomic architecture
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
NGS - Review
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Roche 454 sequencing workflow overview
Sample input: Genomic DNA, BACs,
amplicons, cDNA
Generation of small DNA fragments via shearing
Ligation of A/B-Adaptors flanking single- stranded
DNA fragments
One Fragment
Emulsification of beads and fragments in water-inoil microreactors
One Bead
Clonal amplification of fragments bound to beads in
microreactors
Sequencing and base calling
One Read
400,000 reads
per run
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
GS FLX Data analysis – flowgram generation
T
4‐
A
C
mer
Flow Order
Flowgram
G
3‐mer
TTCTGCGAA
2‐mer
1‐mer
Example of homopolymer errors from 454 sequencing data
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Example of 454 sff file (text format)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Illumina sequencing overview
0.1 - 1.0μg
cBot
GAIIx
User
or core
facility
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Example of Illumina *.fastq file
@C3PO_0001:2:1:17:1499#0/1
TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATTTTTAGTGACGTCGT
GAATCGTTT
+C3PO_0001:2:1:17:1499#0/1
abaaaaaaaaaaa`a`aa_aaaaaaaaaaaaaaaa_a aaa`aaaaa^aaaaa`a]^`a YZYZ^`NJDJ\_Z
@C3PO_0001:2:1:17:1291#0/1
TGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATATCTCCCAATATTGCGCG
TGCTGAATT
+C3PO_0001:2:1:17:1291#0/1
a`_`_\a_aaaa_a^Z^^a[a^aa]a_^_a_``aa `aa`X^X^^`aa_\_]VR`\a_]W\_`_a]a]][\RZV
@C3PO_0001:2:2:1452:1316#0/1
GTCCATCCGCAGCAGCGAATTTTTGACGTCCCCCCCCGAANGGANGNGANNNNGNNGNNNT
NTNNAAANGNNNNN
+C3PO_0001:2:2:1452:1316#0/1
_U a\
`]_`ZP\\_Z^[]aa^a_]XNBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
…
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Genome Assembly
Input reads
V. navarrensis
V. vulnificus
2423-01
2009V-1368
08-2462
06-2432
2541-90
08-2435
2756-81
08-2439
07-2444
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Introduction to genome assembly


An assembly is a hierarchical data
structure that maps the sequence
data to a putative reconstruction of
the target.
In addition to contigs, a set of
unassembled or partially assembled
reads is also given as an output.
Reads
Contigs multiple sequence alignment of
reads plus the consensus sequence.
Scaffolds - define the contig
order and orientation
Output (FASTA)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
How do we check the quality of our assembly?
METRICS!
• N50
• minimum/maximum contig length
• No. of contigs
• No. of errors
• FRC (feature response curve)
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly
•
BREAKPOINT: Points in the assembly where leftover reads partially align;
•
COMPRESSION: Area representing a possible repeat col- lapse;
•
STRETCH: Area representing a possible repeat expansion;
•
LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right
orientation but at low coverage;
•
HIGH_NORMAL_CVG: Area composed of normal oriented reads but at high coverage;
•
HIGH_LINKING_CVG: Area composed of reads with associated mates in another scaffold;
•
HIGH_SPANNING_CVG: Area composed of reads with associated mates in another contig;
•
HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->);
•
HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere);
•
HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;
•
HIGH_SNP: SNP with high coverage;
•
KMER_COV: Problematic k-mer distribution.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly
•
Most of the traditional metrics used to evaluate assemblies (N50, mean contig size, etc.) emphasize only size, while nothing (or
almost nothing) is said about how correct the assemblies are.
•
A typical such metric (especially, in the NGS context) consists in aligning contigs back to an available reference. However, this naive
technique simply counts the number of mis-assemblies without attempting to distinguish or categorize them any further.
•
After running amosvalidate, each contig is assigned the number of features that correspond to doubtful sequences in the assembly.
•
For a fixed feature threshold w, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum
of features is ƒw. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of
the Feature-Response curve (FRC).
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Assembly Challenges
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Challenges

Intrinsic


Genome architecture

Technical


Repeats


Homopolymer runs


Sequence complexity


Chimeras?
Contaminants

Short reads
Poisson distribution of coverage
Sequencing errors
Variable quality
Sequence tags
454 raw
reads
Illumina DeNovo
• Allpaths LG
• SOAP DeNovo
• Velvet
• Abyss
• Taipan
• Bambus2
• SUTTA
Illumina
raw reads
Statistical
analysis
Pre-processing
454
reads
Read stats
V.
vulnificus
CMCP6
•
samstats
Illumina
reads
Illumina
454
GAGE
•
Info.
Assemblers
GAGE
Hawk-eye
•
•
Assemblers
All possible
combinations of the
best 3
contigs * 3
V. vulnificus
MO6-24/O
Align illumina reads
against 454 contigs
Mac vector
CLC wb
•
•
LEGEND
Mimimus
MAIA
•
•
Scaffolds
Unmapped
reads
Finished genome
Contig merging
Unmapped
reads
•
•
contigs
DeNovo assembly
Align Illumina against
the reference
•
PAGIT
Mauve
Nulceotide
identity
Gap filling
•
•
GRASS
Built-in
Genome finishing
contigs
Illumina/(454?)
reference based
assembly
Reference
genome
•
Reference selection
Process
Unmapped
reads
bwa
Compare mapping
statistics
hybrid
Chosen Ref.
Published Genomes from public databases
•
454 DeNovo
• Newbler
• CABOG
• SUTTA
Illumina/ 454/ Hybrid
DeNovo assembly
Pre-processing
V.
vulnificus
YJ016
Illumina
Parameter
optimization
Evaluation
Fastqc
Prinseq
NGS QC
•
•
•
454
Hybrid DeNovo
• Ray
• MIRA
Reference
evaluation
DNA Diff
•
AMOScmp
Reference based assembly
Draft/ Finished genome
•
DNA Diff
Reference
evaluation
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges / Analysis Pipeline-Strategy / Tool Selection / Summary
DNA Diff
Tool Selection - Assembly Algorithm profile
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy

Graph based
Branch-and-Bound
Basic operation: given any read or contig, add one more read or
contig until no more reads or contigs are available


Seed-and-extension
The contigs grow by “greedy extension” always incorporating a read that is found with the
highest scoring overlap
Makes locally optimal choice with the hope of finding a globally
optimal choice

No foresight -> misassembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy
Seed-and-extension
Graph based
Branch-and-Bound
It was the best of
age of wisdom, it was
best of times, it
was it was the age
it was the age of
of
it was the worst of
of times, it was the
of times, it was the
It was the best of
was the best of times,
the best of times, it
best of times, it was
of wisdom, it was the
of times, it was the
the age of wisdom, it
of times, it was the
the best of times, it
times, it was the worst
the worst of times, it
times, it was the age
times, it was the age
times, it was the worst
was the age of wisdom,
was the age of foolishness,
was the best of times,
• It was the best of times, it was the [worst/age]
was the worst of times,
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy


Graph based
Branch-and-Bound
Variation of the greedy assembler


Seed-and-Extension
Common in aligners, thus some assemblers/aligners may incorporate this approach
Particularly designed for short reads based on a contig heuristic
scheme
Prefix-tree data structure

A contig is elongated at either end contingent upon the existence of reads with a prefix of
minimal length perfectly matching the end of the contig
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy
Seed-and-extension
Graph based
Branch-and-Bound
Overlap-layout-consensus (OLC): pairwise consensus
Overlap:
Layout:
find potentially overlapping reads
layout the reads based on
matching alignment
Consensus:
derive the DNA
sequence consensus
by joining read
sequences
..ACGATTACAATAGGTT..
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy
Seed-and-extension
Graph based
Branch-and-Bound
HamiltonianApproach
Find an assembled sequence that explains the observed sequence =
finding a path through a graph that visits every vertex once
Repeat
Repeat
Repeat
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy
Seed-and-extension
Graph based
Branch-and-Bound
de-Brujin Graph

Basic operation: k-mer approach

Eulerian approach
Reads
de Bruijn Graph
AAGA
ACTT
ACTC
ACTG
AGAG
CCGA
CGAC
CTCC
CTGG
CTTT
…
CCG
Potential Genomes
AAGACTCCGACTGGGACTTT
TCC
AAGACTGGGACTCCGACTTT
CGA
AAG
AGA
CTC
GAC
ACT
GGA
CTT
TTT
CTG
GGG
TGG
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Greedy


Seed-and-extension
Graph based
Branch-and-Bound
Basic operation: relies on “consistent layouts”; it generates all
possible consistent layouts organizing them as paths in a “double
tree” structure, rooted at a randomly selected seed read
Progressive evaluation of optimal criteria encoded by a set of score
functions based on the set of overlaps along the layout
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Tid-bits of advice
Greedy
Seed-andExtension
OLC
De-Brujin
Branch-andBound
Advantages
Guaranteed to
find a solution
sensitivity
Suitable for low
coverage long
reads
Repeats are
immediately
recognized;
suitable for high
coverage short
reads
Algorithm allows
for checks
Disadvantages
Misassembly
Can be very slow, Computation of
Easily confused memory usage
overlaps time
by complex
intensive
repeats
RAM intensive
Ambiguities delay
pruning
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Tools of Choice
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
454 platform assembly
Name
Algorithm
Newbler 2.5
OLC
Comparative analysis of algorithms for whole-genome
assembly of pyrosequencing data
CABOG
OLC
Comparative analysis of algorithms for whole-genome
assembly of pyrosequencing data
SUTTA
Branch-andBound
Feature-by-Feature – Evaluating De Novo Sequence
Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Evaluation of 454 assemblers

Genomes Used For Comparison
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of 454 assemblers using E. coli genome
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of 454 assemblers using E. coli genome

The maximum value reached by the bars is the hypothetical reconstruction HR, defined as the ratio between the assembled bases and the reference
length

The white section represents the real reconstruction RR, i.e. the portion of genome correctly reconstructed by assemblers.

The difference between hypothetical and RR, here called erroneous reconstruction ER, is shown in black
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Illumina platform assembly
Name
ALLPATHS-LG
Velvet
Taipan
Algorithm
OLC
de-Brujin
Supporting Evidence
GAGE: A critical evaluation of genome assemblies and
assembly algorithms
Comparative studies of de novo assembly tools for nextgeneration sequencing technologies
A Practical Comparison of De Novo Genome Assembly
Hybrid(GreedySoftware Tools for Next-Generation Sequencing
based and
Technologies
graph)
SOAPdenovo
de-Brujin
SUTTA
Branch-andBound
Feature-by-Feature- Evaluating De Novo Sequence
Assembly
Feature-by-Feature – Evaluating De Novo Sequence
Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Evaluation of illumina assemblers

Genomes Used For Comparison
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
•
The best value for each column is shown in bold. For all assemblies
•
The Errors column contains the number of misjoins plus indel errors >5 bp for contigs, and the total number of misjoins for scaffolds.
•
Corrected N50 values were computed after correcting contigs and scaffolds by breaking them at each error. See the evaluation section in the
text for details on how errors were identified.
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
•
A ‘‘chaff’’ contig is defined as a single contig <200 bp in length. In many cases, these contigs can be as small as the k-mer size used
to build the de Bruijn graph (e.g., 36 bp) and are too short to support any further genomic analysis.
•
A duplicated repeat is one that appears in more copies than necessary in the assembly, and a compressed repeat is one that occurs in
fewer copies.
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
•
‘‘Misjoin’’ errors are perhaps the most harmful type, in that they represent a significant structural error. A misjoin occurs when an
assembler incorrectly joins two distant loci of the genome, which most often occurs within a repeat sequence.
•
We have tallied three types of misjoins: (1) inversions, where part of a contig or scaffold is reversed with respect to the true genome; (2)
relocations, or rearrangements that move a contig or scaffold within a chro- mosome; and (3) translocations, or rearrangements between
chromosomes
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
•
Average contig (A) and scaffold (B) sizes, measured by N50 values, versus error rates, averaged over all three genomes for which the true assembly is
known: S. aureus, R. sphaeroides, and human chromosome 14.
•
Errors (vertical axis) are measured as the average distance between errors, in kilobases.
•
In both plots, the best assemblers appear in the upper right.
GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Applicability of assemblers

Genomes used for comparison
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Comparison of illumina assemblers
A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Hybrid Platform Assembly
Name
RAY
Algorithm
SBH
Supporting Evidence
Feature-by-Feature – Evaluating De Novo Sequence Assembly
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly
•
COMPRESSION: Area representing a possible repeat col- lapse;
•
LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right
orientation but at low coverage;
•
HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->);
•
HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere);
•
HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;
•
KMER_COV: Problematic k-mer distribution.
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature: evaluating de-novo assembly

Real Data - Long Reads
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Feature-by-Feature – evaluating de-novo assembly

Real Data - Short Reads
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
Final Approach
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
454 raw
reads
Illumina DeNovo
• Allpaths LG
• SOAP DeNovo
• Velvet
• Taipan
• SUTTA
Illumina
raw reads
Statistical
analysis
454
reads
V.
vulnificus
CMCP6
•
samstats
Illumina
reads
Assemblers
Assemblers
All possible
combinations of the
best 3
contigs * 3
V. vulnificus
MO6-24/O
Align illumina reads
against 454 contigs
Mac vector
CLC wb
•
•
LEGEND
Mimimus
MAIA
•
•
Scaffolds
Unmapped
reads
Finished genome
Contig merging
Unmapped
reads
•
•
contigs
DeNovo assembly
Align Illumina against
the reference
•
PAGIT
Mauve
Nulceotide
identity
Gap filling
•
•
GRASS
Built-in
Genome finishing
contigs
Illumina/(454?)
reference based
assembly
Reference
genome
•
Reference selection
Info.
GAGE
Hawk-eye
•
•
Unmapped
reads
bwa
Compare mapping
statistics
Illumina
454
GAGE
•
Chosen Ref.
Published Genomes from public databases
•
Process
Illumina/ 454/ Hybrid
DeNovo assembly
Pre-processing
V.
vulnificus
YJ016
hybrid
Evaluation
Fastqc
Prinseq
NGS QC
Read stats
Illumina
Parameter
optimization
454 DeNovo
• Newbler
• CABOG
• SUTTA
Pre-processing
•
•
•
454
Hybrid DeNovo
• Ray
Reference
evaluation
MUMer
•
AMOScmp
Reference based assembly
Draft/ Finished genome
•
DNA Diff
Reference
evaluation
Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary
MUMer
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Finotello, F., et al., Comparative analysis of algorithms for whole-genome assembly of
pyrosequencing data. Brief Bioinform, 2012. 13(3): p. 269-80.
Vezzi, F., G. Narzisi, and B. Mishra, Feature-by-feature--evaluating de novo sequence
assembly. PLoS One, 2012. 7(2): p. e31002.
Zhang, W., et al., A practical comparison of de novo genome assembly software tools for
next-generation sequencing technologies. PLoS One, 2011. 6(3): p. e17915.
Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly
algorithms. Genome Res, 2012. 22(3): p. 557-67.
Narzisi, G. and B. Mishra, Comparing de novo genome assembly: the long and short of it.
PLoS One, 2011. 6(4): p. e19175.
Miller, J.R., S. Koren, and G. Sutton, Assembly algorithms for next-generation sequencing
data. Genomics, 2010. 95(6): p. 315-27.
Li, Z., et al., Comparison of the two major classes of assembly algorithms: overlap-layoutconsensus and de-bruijn-graph. Brief Funct Genomics, 2012. 11(1): p. 25-37.
Lin, Y., et al., Comparative studies of de novo assembly tools for next-generation
sequencing technologies. Bioinformatics, 2011. 27(15): p. 2031-7.
Zhang, J., et al., The impact of next-generation sequencing on genomics. J Genet
Genomics, 2011. 38(3): p. 95-109.