Transcript Lee Katz - Compgenomics2014
Computational assembly for prokaryotic sequencing projects
Lee Katz, Ph.D.
Bioinformatician, Enteric Diseases Laboratory Branch January 15, 2014
Disclaimers
The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy.
The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC
Partners in Public Health
Graduated Oct 2010
CDC 2010 - present
Lee Katz, Present
• • • Currently in the National Enteric Reference Laboratory Vibrio, Campylobacter, Escherichia, Shigella, Yersinia, Salmonella Focusing on Listeria and Vibrio
One of my projects is #2 on CDC’s list of accomplishments for 2013!
#2 http://www.cdc.gov/features/endofyear/
• Sequencing – – 1 st gen 2 nd gen – 3 rd gen
Outline
• • Reads – Quality control (Q/C) • Read metrics – Read-cleaning Assembly – Algorithms – Assembly metrics
Prokaryotic Sequencing Projects
• • • • • •
Stages
Sequencing Assembly Feature prediction Functional annotation …analysis… Display (Genome Browser) Sequencing Assembly prediction • • • • •
Examples
Haemophilus influenzae Neisseria meningitidis Bordetella bronchisceptica Vibrio cholerae Listeria monocytogenes
annotation Display Fleischman et al. (1995) “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd” Science 269:5223 Kislyuk et al. (2010) “A computational genomics pipeline for prokaryotic sequencing projects” Bioinformatics 26:15
Out with the old; in with the new: Two new technologies to the compgenomics class!
• • • • 454 Illumina single end reads Illumina paired end reads PacBio
Sanger Sequencing (1
st
gen)
Sequencing: first generation
Sheer DNA Cloning into bacterial vectors Amplification Sanger sequencing Margulies et al. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature 437:7057
Sanger sequencing output
• Usually .ab1/.scf file format
454 Sequencing (2
nd
Gen)
B A
454 Pyrosequencing
+ PCR Reagents + Emulsion Oil Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR
Load beads into PicoTiter™Plate
454 Pyrosequencing
Load enzyme beads
44 μm
PicoTiter™Plate
Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm -2 1.6 million wells per slide
Photons generated are captured by CCD camera
454 Pyrosequencing
Sequencing by synthesis Reagent flow
Margulies et al., 2005
454 sequencing output
Flowgram (.sff file format)
4-mer 3-mer Flow Order T A C G KEY (TCAG) 2-mer 1-mer Measures the presence or absence of each nucleotide at any given position
Illumina sequencing (2
nd
Gen)
Region complementary to P5 grafting primer Index 2 P5 primer DNA insert P7 primer Index 1 P7 grafting primer Flow cell surface The following animations are courtesy of Illumina, Inc.
P5 grafting primer
The following animations are courtesy of Illumina, Inc.
SBS Sequencing Primer Hybridization
The following animations are courtesy of Illumina, Inc.
Sequence (Cycle 1)
Sequence (Cycle 1)
Index 1 Seq Primer Hybridization
Index 1 read – 8 cycles
Unblock
P5 grafting primer
7 dark cycles P5 grafting primer
Index 2 index read 8 cycles 7 dark cycles P5 grafting primer
Index 2 index read 8 cycles 7 dark cycles P5 grafting primer
Linearization
Original strand New strand
Illumina sequencing video
• http://www.youtube.com/watch?v=womKfik WlxM
PacBio sequencing* (3
rd
Gen)
*Pacific Biosciences
http://www.youtube.com/watch?v=NHCJ8PtYCFc SMRT Bell Thanks to PacBio for donating some slide materials in this section Zero-mode waveguide (ZMW), a very fancy and very small well Eid et al Science, January 2009/10.1126/science.1162986
http://www.youtube.com/watch?v=NHCJ8PtYCFc Eid et al Science, January 2009/10.1126/science.1162986
Eid et al Science, January 2009/10.1126/science.1162986
PacBio video
http://www.youtube.com/watch?v=NHCJ8PtYCFc
Q/C + cleaning + metrics
READS
Q/C
• • You need to know if your data are good!
Example software – FastQC – Computational Genomics Pipeline (CG-Pipeline)
Quality Control
FastQC output
Quality Control bioinformatics
FastQC output
The CG-Pipeline way
run_assembly_readMetrics.pl
File avgReadLength tmp.fastq
totalBases minReadLength maxReadLength avgQuality 80.00 177777760 80 80 35.39
Read cleaning
Read
Read cleaning with CG-Pipeline
(not validated; please use with caution) R. Read F. Read %ACGT Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)
Read 1A.
%ACGT
1. Trimming low-qual ends
run_assembly_trimLowQualEnds.pl
R. Read F. Read 1B. Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)
2a. Removing duplicate reads 2b. Sometimes: downsampling
run_assembly_removeDuplicateReads.pl
Trimmed reads http://sourceforge.net/projects/cg-pipeline/
3A.
trimming
3. Trimming and filtering
run_assembly_trimClean.pl
Min length Min avg. quality 3B.
filtering Min length http://sourceforge.net/projects/cg-pipeline/ Min avg. quality
More
• • Software – Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ – EA-utils https://code.google.com/p/ea-utils/ – AMOS amos: SourceForge.net
– … and more is out there!
Evaluation – Fabbro et al 2013, “An extensive evaluation of read trimming effects on Illumina NGS data analysis”
Algorithms + metrics
ASSEMBLY
Whole genome sequencing: WGS
Large pieces and de novo assembly “Business dog” http://www.buzzfeed.com/tiad/business-dog 52
Whole genome sequencing: WGS
Small pieces and reference assembly “Business cat” http://www.quickmeme.com/Business-Cat/ 53
Assembly
• Overlaps between reads • Generate contigs (contiguous sequences) • Generate scaffolds NNN N
Derive consensus sequence
TAG A TTACACAGATTA C TGA TTGATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA A CTA TAG TTACACAGATTA T TGA C TTCATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG G GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA CTA Derive each consensus base by weighted voting
Slide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14-compgenomics kislyuk.pdf
reads Paired end reads contigs Scaffold
Recap of assembly
NNNNNNNNNN
CG-Pipeline way for Illumina
• • run_assembly reads.fastq.gz –o assembly.fasta
No module yet in CGP for PacBio unfortunately… Be on the look out for several papers that compare Illumina assemblers.
PacBio Assembly
• The following slides are courtesy of PacBio
Finishing Genomes Using Only PacBio ® Reads
Hierarchical Genome Assembly Process (HGAP)
• Utilizes all PacBio data from single, long-insert library – Longest reads for continuity – All reads for high consensus accuracy • Now available through SMRT ® Portal in SMRT Analysis v2.0.1
Chin et al (2013 ), “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”
Nature Methods
. doi 10.1038/nmeth.2474
Hierarchical Genome Assembly Process (HGAP) 1. Start with long ‘seed’ reads 2. Align other reads 3. Build consensus 4. Construct accurate (>99%) pre-assembled reads
HGAP Example -
Meiothermus ruber
10 kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®
RS
) Long seed reads (>5kb)
pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute) >5 kb 250 Mb
HGAP Example -
Meiothermus ruber
10 kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®
RS
) Long seed reads (>5 kb)
pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
HGAP Example -
Meiothermus ruber
10kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®
RS
) Long seed reads (>5 kb)
Pre-assembly
Pre-assembled long reads
Celera Assembler
5 contigs
Polish, Quiver
1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
HGAP Example -
Meiothermus ruber
10kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®
RS
) Long seed reads (>5 kb)
Pre-assembly
Pre-assembled long reads
Celera ® Assembler Minimus2
5 contigs
Quiver
1 contig 1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted
Polish with Quiver for High Accuracy
Organism
Meiothermus ruber
Assembly size (bases) Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences
3,098,781 11 99.99965% 54.5
8 1(3)
QV 60 M. ruber Sanger reference PacBio ® reads Targeted Sanger validation
Estimated Coverage Targets for Finishing Smaller Genomes
Assembly Approach / Software Tool Hierarchical
SMRT ® Analysis implementation of HGAP (uses Celera ® Assembler 7.0) Celera® Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752
Hybrid
Celera ® Assembler 7.0 with PacBiotoCA (SMRT ® Analysis)
Recommended PacBio ® Coverage Additional Data Sets
75-100X PacBio CLR None 75-100X PacBio CLR None 20-50X PacBio CLR 50X short reads
Genome Size Constraints
< 10 MB (SMRT Portal) < 130 MB (Command Line) Similar to above ALLPATHS-LG MIRA (with PacBiotoCA)
Scaffolding
50X PacBio 3 kb CLR - 50X Illumina ® - 50X Illumina ® PE jumping libraries 20 MB 20-50X PacBio CLR 50X short reads AHA (SMRT Analysis) 10X PacBio CLR High-confidence contigs <200 MB; <20,000 contigs
Selected publications with PacBio
• • Application – Katz et al 2013 Mbio “Evolutionary Dynamics of Vibrio cholerae O1 following a Single-Source Introduction to Haiti” – Chin et al 2011 NEJM “The Origin of the Haitian Cholera Outbreak Strain” – Rasko et al 2011 NEJM “Origins of the E. coli Strain Causing an Outbreak of Hemolytic–Uremic Syndrome in Germany” Assemblers – Chin et al 2013 Nature “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data” – Koren et al 2012 Nature Biotechnology “Hybrid error correction and de novo assembly of single-molecule sequencing reads” – Bashir et al 2012 Nature Biotechnology “A hybrid approach for the automated finishing of bacterial genomes”
The epigenome
• • • PacBio can also detect epigenetic modifications, especially the methylome Roles for DNA methylation and methyltransferases are uncharacterized in most bacteria Not your primary task, but it could be a very interesting and novel result of the 2014 compgenomics class Davis et al 2013, Current opinion in microbiology “Entering the era of bacterial epigenomics with single molecule real time DNA sequencing”
ASSEMBLY DIFFICULTIES
• • • • • •
One problem: randomly low coverage (Lander-Waterman)
Assuming random distribution of reads and ignoring repeat resolution issues, G= genome length L = length of a single read N= number of reads sequenced T= minimum overlap to align the reads together Then overall coverage is
C = LN/G
• • Coverage for any given base obeys the Poisson distribution: The number of
gaps(bases with 0 coverage) is:
A good Lander-Waterman reference: http://www.math.ucsd.edu/~gptesler/186/shotgun_08-handout.pdf
Lander and Waterman (1988). Genomic Mapping by Fingerprinting random clones: a mathematical analysis. Genomics 2 pp. 231-239 http://www.cmb.usc.edu/papers/msw_papers/msw-081.pdf
Major Problem: repeat elements
?
?
Major Problem: repeat elements
Major Problem: repeat elements
A (1) D (1) B (2) C (2) E (2) F (1) Unipath graph of the 1.8-Mb genome of C. jejuni
Possible paths: ABC DBCEF CEG ABC EFCDB CEG
G (1)
Butler J et al. (1998) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820
Comparisons of assemblers
• • Second-generation – Zhang et al 2011 “A practical comparison of de novo genome assembly software tools for next-generation sequencing technolgies” – Genome Assembly Gold-standard Evaluations (GAGE) http://gage.cbcb.umd.edu/ – Lin et al 2011, “Comparative studies of de novo assembly tools for next-generation sequencing technologies” – Don’t ignore many newer assemblers including SPAdes Third-generation – Too new!
Zhang et al 2011
A quick note on reference assembly
• • • It is possible to map reads to a reference genome using short-read mappers, and then deriving a consensus sequence Some favorites – Smalt – BWA – Bowtie Must know how to use samtools for this route
Reference assembly notes
• • • To my knowledge no paper exists that compares reference assemblers (could be wrong) Assembly could be biased – Miss genomic islands – Ends of contigs might be tapered Good practice for reference assembly: perform de novo assembly on unused reads just in case you missed something
Assembly Metrics
How do you tell if your assembly is good?
Metric
Assembly Length Number of Contigs N50
description
The size of the concatenated assembly The count of contigs The size of the contig at where half the genome is located in size >N50 and half is located in size reference Longest Contig Average contig length Kmer21 GC-content Assembly score Frequency of kmers with k=21 Percentage of the genome that is either G or C Log(N50/numberOfContigs) http://www.homolog.us/blogs/blog/2012/06/26/ what-is-wrong-with-n50-how-can-we-make-it better-part-ii/ CG-Pipeline/Lee Katz • http://bioinf.spbau.ru/quast $ run_assembly_metrics.pl assembly.fasta|column –t File genomeLength assembly.fasta N50 numContigs assemblyScore 2992976 2992976 1 14.9117787680924 • • amosvalidate http://amos.sourceforge.net – (build this from source to avoid bugs) • • View the pileups and see if you agree with base calls – – Hawkeye (AMOS) IGV viewer – – Tablet viewer Samtools tview (command line interface) Compare to other genomes; sort contigs – – MAUVE and MAUVE contig mover Mummer (mummerplot) – Abacas • Close gaps: Scaffold with Illumina – GRASS – SSPACE – SOPRA – – Babmus2 (AMOS) IMAGE – Many others • Scaffold with PacBio – AHA (assembler) NNN N • • • • Every single compgenomics class For help with slides: Eleni Paxinos, Andrew Huang, Amber Schmitke, Maryann Turnsek For letting me off work, my supervisor Cheryl Tarr Many, many others who I work with on a daily basis Lee Katz [email protected] The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC 86Assembly evaluation
QUAST
The CG-Pipeline way
AMOS
Post-assembly manual methods
Post-assembly manual methods
Acknowledgements
Questions?