Lee Katz - Compgenomics2014

Transcript Lee Katz - Compgenomics2014

Computational assembly for prokaryotic sequencing projects

Lee Katz, Ph.D.

Bioinformatician, Enteric Diseases Laboratory Branch January 15, 2014

Disclaimers

The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy.

The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC

Partners in Public Health

Graduated Oct 2010

CDC 2010 - present

Lee Katz, Present

• • • Currently in the National Enteric Reference Laboratory Vibrio, Campylobacter, Escherichia, Shigella, Yersinia, Salmonella Focusing on Listeria and Vibrio

One of my projects is #2 on CDC’s list of accomplishments for 2013!

#2 http://www.cdc.gov/features/endofyear/

• Sequencing – – 1 st gen 2 nd gen – 3 rd gen

Outline

• • Reads – Quality control (Q/C) • Read metrics – Read-cleaning Assembly – Algorithms – Assembly metrics

Prokaryotic Sequencing Projects

• • • • • •

Stages

Sequencing Assembly Feature prediction Functional annotation …analysis… Display (Genome Browser) Sequencing Assembly prediction • • • • •

Examples

Haemophilus influenzae Neisseria meningitidis Bordetella bronchisceptica Vibrio cholerae Listeria monocytogenes

annotation Display Fleischman et al. (1995) “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd” Science 269:5223 Kislyuk et al. (2010) “A computational genomics pipeline for prokaryotic sequencing projects” Bioinformatics 26:15

Out with the old; in with the new: Two new technologies to the compgenomics class!

• • • • 454 Illumina single end reads Illumina paired end reads PacBio

Sanger Sequencing (1

gen)

Sequencing: first generation

Sheer DNA Cloning into bacterial vectors Amplification Sanger sequencing Margulies et al. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature 437:7057

Sanger sequencing output

• Usually .ab1/.scf file format

454 Sequencing (2

Gen)

B A

454 Pyrosequencing

+ PCR Reagents + Emulsion Oil Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR

Load beads into PicoTiter™Plate

454 Pyrosequencing

Load enzyme beads

44 μm

PicoTiter™Plate

Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm -2 1.6 million wells per slide

Photons generated are captured by CCD camera

454 Pyrosequencing

Sequencing by synthesis Reagent flow

Margulies et al., 2005

454 sequencing output

Flowgram (.sff file format)

4-mer 3-mer Flow Order T A C G KEY (TCAG) 2-mer 1-mer Measures the presence or absence of each nucleotide at any given position

Illumina sequencing (2

Gen)

Region complementary to P5 grafting primer Index 2 P5 primer DNA insert P7 primer Index 1 P7 grafting primer Flow cell surface The following animations are courtesy of Illumina, Inc.

P5 grafting primer

The following animations are courtesy of Illumina, Inc.

SBS Sequencing Primer Hybridization

The following animations are courtesy of Illumina, Inc.

Sequence (Cycle 1)

Index 1 Seq Primer Hybridization

Index 1 read – 8 cycles

Unblock

P5 grafting primer

7 dark cycles P5 grafting primer

Index 2 index read 8 cycles 7 dark cycles P5 grafting primer

Linearization

Original strand New strand

Illumina sequencing video

• http://www.youtube.com/watch?v=womKfik WlxM

PacBio sequencing* (3

Gen)

*Pacific Biosciences

http://www.youtube.com/watch?v=NHCJ8PtYCFc SMRT Bell Thanks to PacBio for donating some slide materials in this section Zero-mode waveguide (ZMW), a very fancy and very small well Eid et al Science, January 2009/10.1126/science.1162986

http://www.youtube.com/watch?v=NHCJ8PtYCFc Eid et al Science, January 2009/10.1126/science.1162986

Eid et al Science, January 2009/10.1126/science.1162986

PacBio video

http://www.youtube.com/watch?v=NHCJ8PtYCFc

Q/C + cleaning + metrics

READS

Q/C

• • You need to know if your data are good!

Example software – FastQC – Computational Genomics Pipeline (CG-Pipeline)

Quality Control

FastQC output

Quality Control bioinformatics

FastQC output

The CG-Pipeline way

run_assembly_readMetrics.pl

File avgReadLength tmp.fastq

totalBases minReadLength maxReadLength avgQuality 80.00 177777760 80 80 35.39

Read cleaning

Read

Read cleaning with CG-Pipeline

(not validated; please use with caution) R. Read F. Read %ACGT Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)

Read 1A.

%ACGT

1. Trimming low-qual ends

run_assembly_trimLowQualEnds.pl

R. Read F. Read 1B. Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)

2a. Removing duplicate reads 2b. Sometimes: downsampling

run_assembly_removeDuplicateReads.pl

Trimmed reads http://sourceforge.net/projects/cg-pipeline/

3A.

trimming

3. Trimming and filtering

run_assembly_trimClean.pl

Min length Min avg. quality 3B.

filtering Min length http://sourceforge.net/projects/cg-pipeline/ Min avg. quality

• • Software – Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ – EA-utils https://code.google.com/p/ea-utils/ – AMOS amos: SourceForge.net

– … and more is out there!

Evaluation – Fabbro et al 2013, “An extensive evaluation of read trimming effects on Illumina NGS data analysis”

Algorithms + metrics

ASSEMBLY

Whole genome sequencing: WGS

Large pieces and de novo assembly “Business dog” http://www.buzzfeed.com/tiad/business-dog 52

Whole genome sequencing: WGS

Small pieces and reference assembly “Business cat” http://www.quickmeme.com/Business-Cat/ 53

Assembly

• Overlaps between reads • Generate contigs (contiguous sequences) • Generate scaffolds NNN N

Derive consensus sequence

TAG A TTACACAGATTA C TGA TTGATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA A CTA TAG TTACACAGATTA T TGA C TTCATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG G GTAA CTA TAG A TTACACAGATTA C TGA C TTGATGG C GTAA CTA Derive each consensus base by weighted voting

Slide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14-compgenomics kislyuk.pdf

reads Paired end reads contigs Scaffold

Recap of assembly

NNNNNNNNNN

CG-Pipeline way for Illumina

• • run_assembly reads.fastq.gz –o assembly.fasta

No module yet in CGP for PacBio unfortunately… Be on the look out for several papers that compare Illumina assemblers.

PacBio Assembly

• The following slides are courtesy of PacBio

Finishing Genomes Using Only PacBio ® Reads

Hierarchical Genome Assembly Process (HGAP)

• Utilizes all PacBio data from single, long-insert library – Longest reads for continuity – All reads for high consensus accuracy • Now available through SMRT ® Portal in SMRT Analysis v2.0.1

Chin et al (2013 ), “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”

Nature Methods

. doi 10.1038/nmeth.2474

Hierarchical Genome Assembly Process (HGAP) 1. Start with long ‘seed’ reads 2. Align other reads 3. Build consensus 4. Construct accurate (>99%) pre-assembled reads

HGAP Example -

Meiothermus ruber

10 kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®

) Long seed reads (>5kb)

pre-assembly

Pre-assembled long reads

Celera Assembler

5 contigs

Polish, Quiver

1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute) >5 kb 250 Mb

HGAP Example -

Meiothermus ruber

10 kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®

) Long seed reads (>5 kb)

pre-assembly

Pre-assembled long reads

Celera Assembler

5 contigs

Polish, Quiver

1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

HGAP Example -

Meiothermus ruber

10kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®

) Long seed reads (>5 kb)

Pre-assembly

Pre-assembled long reads

Celera Assembler

5 contigs

Polish, Quiver

1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

HGAP Example -

Meiothermus ruber

10kb SMRTbell™ library 3 SMRT ® Cells (C2-C2 Chemistry, PacBio ®

) Long seed reads (>5 kb)

Pre-assembly

Pre-assembled long reads

Celera ® Assembler Minimus2

5 contigs

Quiver

1 contig 1 contig Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted

Polish with Quiver for High Accuracy

Organism

Meiothermus ruber

Assembly size (bases) Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences

3,098,781 11 99.99965% 54.5

8 1(3)

QV 60 M. ruber Sanger reference PacBio ® reads Targeted Sanger validation

Estimated Coverage Targets for Finishing Smaller Genomes

Assembly Approach / Software Tool Hierarchical

SMRT ® Analysis implementation of HGAP (uses Celera ® Assembler 7.0) Celera® Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752

Hybrid

Celera ® Assembler 7.0 with PacBiotoCA (SMRT ® Analysis)

Recommended PacBio ® Coverage Additional Data Sets

75-100X PacBio CLR None 75-100X PacBio CLR None 20-50X PacBio CLR 50X short reads

Genome Size Constraints

< 10 MB (SMRT Portal) < 130 MB (Command Line) Similar to above ALLPATHS-LG MIRA (with PacBiotoCA)

Scaffolding

50X PacBio 3 kb CLR - 50X Illumina ® - 50X Illumina ® PE jumping libraries 20 MB 20-50X PacBio CLR 50X short reads AHA (SMRT Analysis) 10X PacBio CLR High-confidence contigs <200 MB; <20,000 contigs

Selected publications with PacBio

• • Application – Katz et al 2013 Mbio “Evolutionary Dynamics of Vibrio cholerae O1 following a Single-Source Introduction to Haiti” – Chin et al 2011 NEJM “The Origin of the Haitian Cholera Outbreak Strain” – Rasko et al 2011 NEJM “Origins of the E. coli Strain Causing an Outbreak of Hemolytic–Uremic Syndrome in Germany” Assemblers – Chin et al 2013 Nature “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data” – Koren et al 2012 Nature Biotechnology “Hybrid error correction and de novo assembly of single-molecule sequencing reads” – Bashir et al 2012 Nature Biotechnology “A hybrid approach for the automated finishing of bacterial genomes”

The epigenome

• • • PacBio can also detect epigenetic modifications, especially the methylome Roles for DNA methylation and methyltransferases are uncharacterized in most bacteria Not your primary task, but it could be a very interesting and novel result of the 2014 compgenomics class Davis et al 2013, Current opinion in microbiology “Entering the era of bacterial epigenomics with single molecule real time DNA sequencing”

ASSEMBLY DIFFICULTIES

• • • • • •

One problem: randomly low coverage (Lander-Waterman)

Assuming random distribution of reads and ignoring repeat resolution issues, G= genome length L = length of a single read N= number of reads sequenced T= minimum overlap to align the reads together Then overall coverage is

C = LN/G

• • Coverage for any given base obeys the Poisson distribution: The number of

gaps(bases with 0 coverage) is:

A good Lander-Waterman reference: http://www.math.ucsd.edu/~gptesler/186/shotgun_08-handout.pdf

Lander and Waterman (1988). Genomic Mapping by Fingerprinting random clones: a mathematical analysis. Genomics 2 pp. 231-239 http://www.cmb.usc.edu/papers/msw_papers/msw-081.pdf

Major Problem: repeat elements

A (1) D (1) B (2) C (2) E (2) F (1) Unipath graph of the 1.8-Mb genome of C. jejuni

Possible paths: ABC DBCEF CEG ABC EFCDB CEG

G (1)

Butler J et al. (1998) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Comparisons of assemblers

• • Second-generation – Zhang et al 2011 “A practical comparison of de novo genome assembly software tools for next-generation sequencing technolgies” – Genome Assembly Gold-standard Evaluations (GAGE) http://gage.cbcb.umd.edu/ – Lin et al 2011, “Comparative studies of de novo assembly tools for next-generation sequencing technologies” – Don’t ignore many newer assemblers including SPAdes Third-generation – Too new!

Zhang et al 2011

A quick note on reference assembly

• • • It is possible to map reads to a reference genome using short-read mappers, and then deriving a consensus sequence Some favorites – Smalt – BWA – Bowtie Must know how to use samtools for this route

Reference assembly notes

• • • To my knowledge no paper exists that compares reference assemblers (could be wrong) Assembly could be biased – Miss genomic islands – Ends of contigs might be tapered Good practice for reference assembly: perform de novo assembly on unused reads just in case you missed something

Assembly Metrics

How do you tell if your assembly is good?

Metric

Assembly Length Number of Contigs N50

description

The size of the concatenated assembly The count of contigs The size of the contig at where half the genome is located in size >N50 and half is located in size

reference

Longest Contig Average contig length Kmer21 GC-content Assembly score Frequency of kmers with k=21 Percentage of the genome that is either G or C Log(N50/numberOfContigs) http://www.homolog.us/blogs/blog/2012/06/26/ what-is-wrong-with-n50-how-can-we-make-it better-part-ii/ CG-Pipeline/Lee Katz

Assembly evaluation

QUAST

• http://bioinf.spbau.ru/quast

The CG-Pipeline way

$ run_assembly_metrics.pl assembly.fasta|column –t

File genomeLength assembly.fasta

N50 numContigs assemblyScore 2992976 2992976 1 14.9117787680924

AMOS

• • amosvalidate http://amos.sourceforge.net

– (build this from source to avoid bugs)

Post-assembly manual methods

• • View the pileups and see if you agree with base calls – – Hawkeye (AMOS) IGV viewer – – Tablet viewer Samtools tview (command line interface) Compare to other genomes; sort contigs – – MAUVE and MAUVE contig mover Mummer (mummerplot) – Abacas

Post-assembly manual methods

• Close gaps: Scaffold with Illumina – GRASS – SSPACE – SOPRA – – Babmus2 (AMOS) IMAGE – Many others • Scaffold with PacBio – AHA (assembler) NNN N

Acknowledgements

• • • • Every single compgenomics class For help with slides: Eleni Paxinos, Andrew Huang, Amber Schmitke, Maryann Turnsek For letting me off work, my supervisor Cheryl Tarr Many, many others who I work with on a daily basis

Questions?

Lee Katz [email protected]

The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC 86

Lee Katz - Compgenomics2014

Transcript Lee Katz - Compgenomics2014

Computational assembly for prokaryotic sequencing projects

Partners in Public Health

Graduated Oct 2010

CDC 2010 - present

Lee Katz, Present

Outline

Prokaryotic Sequencing Projects

Out with the old; in with the new: Two new technologies to the compgenomics class!

Sanger Sequencing (1

gen)

Sequencing: first generation

Sanger sequencing output

454 Sequencing (2

Gen)

454 Pyrosequencing

454 Pyrosequencing

454 Pyrosequencing

454 sequencing output

Illumina sequencing (2

Gen)

Illumina sequencing video

PacBio sequencing* (3

Gen)

PacBio video

Q/C

Quality Control

Quality Control bioinformatics

The CG-Pipeline way

Read cleaning

Read cleaning with CG-Pipeline

1. Trimming low-qual ends

2a. Removing duplicate reads 2b. Sometimes: downsampling

3. Trimming and filtering

More

Whole genome sequencing: WGS

Whole genome sequencing: WGS

Assembly

Derive consensus sequence

Recap of assembly

CG-Pipeline way for Illumina

PacBio Assembly

Selected publications with PacBio

The epigenome

One problem: randomly low coverage (Lander-Waterman)

Major Problem: repeat elements

Major Problem: repeat elements

Major Problem: repeat elements

Comparisons of assemblers

A quick note on reference assembly

Reference assembly notes

Assembly Metrics

Assembly evaluation

QUAST

The CG-Pipeline way

AMOS

Post-assembly manual methods

Post-assembly manual methods

Acknowledgements

Questions?

Directory