Assembling Genomes from Next-Generation Sequencers Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu.

Transcript Assembling Genomes from Next-Generation Sequencers Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu.

Assembling Genomes from Next-Generation Sequencers

Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu

Solexa sequencing

What can we do with next-gen sequencers?

Assembling genomes from very short reads (part 1) 2.

Mapping millions of reads to the human genome (part 2)

Assemble a bacterial genome entirely from Solexa reads Target: a novel strain of

Pseudomonas aeruginosa

isolated from a frostbite patient Every read exactly 33 bp long 8,627,900 reads generated approximately 41X coverage just 1/4 of a single Solexa run

Assembly strategy

Throw every trick in the book at it Use related genomes 3 finished strains available Use de novo assemblers New gene-boosting assembly method

Pseudomonas aeruginosa

A leading cause of hospital-acquired infections, especially of the lungs Leading cause of infections in cystic fibrosis patients Large (~6.5 Mbp) bacterial genome high GC - 66%

Comparative Assembly

AMOScmp assembles a genome using a related species Fast, accurate assembly http://amos.sourceforge.net

Comparative assembly using multiple genomes Comparative assembly A Reference genome A Divergent regions Z X Y Target genome Reference genome B Comparative assembly B

Comparative assembly using multiple genomes AMOScmp assembly PA14 reference Contigs Contigs >200bp Max contig 2053 428 170,485 PA01 reference 2797 865 75,626

Comparative assembly using multiple genomes Assembly A Assembly B Merge Merged assembly

Comparative assembly using multiple genomes AMOS-Cmp assembly Contigs Contigs >200bp Max contig PA14 reference 2053 428 170,485 PA01 reference Merged 2797 1850 865 75,626 306 236,472

De novo assembly

Several new methods available Short reads require long overlaps e.g., 33 bp reads must overlap by 20 bp end-trimming helps

De novo assembly strategies

SSAKE Warren et al., 2007 Uses DNA prefix tree to find k-mer matches Edena Hernandez et al., 2008 overlap-layout algorithm adapted for short reads Velvet Zerbino and Birney, 2008 Uses DeBruijn graph algorithm plus error correction

De novo

Assembler performance

● All three programs run parameters on the same data set ● input: 8.6 million reads ● with default platform: 64-bit Opteron, 4 CPUs, 32 GB memory Program SSAKE Edena Velvet Version 3.0

2.11

0.5

CPU time 2:24:59 0:28:31 0:08:48 Wall clock 5:08:59 28:58 10:36

De novo

assemblies

Program # Contigs N50 (bp) Sum (bp) Max contig SSAKE Edena Velvet Program SSAKE Edena Velvet 185,030 87 11,180 837 10,684 # Contigs >200 bp 12,532 1,184 N50 (bp) 549 8,316 7,382 902 1,252 14,287,07 9 6,175,460 6,841,458 5,490 11,300 16,239 Sum (bp) Singletons 6,090,567 3,164,495 5,759,209 3,955,865 6,474,426 1,273,164

Gene-boosted assembly

Contig 1 Contig 2 Gap-spanning gene Gap-spanning gene sequence Translated amino acid sequence Translated, mapped reads

Comparative assembly using multiple genomes Assembly strategy Contigs Contigs >200bp Max contig Merged, AMOS-Cmp 1850 306 236,472 Gene-boosted 120 120 512,638 Note: input to Gene-boosted assembly included 306 contigs from Merged assembly

Final assembly

76 contigs in one large scaffold, 6.3 Mb Largest contig: 512,638 bp additional 436 small contigs spanning 417 kb 9% of the reads unused 5602 protein-coding genes 5568 in PAO1 5892 in PA14

Challenges of next-gen sequencing

Assembling genomes from very short reads (part 1) 2.

Mapping millions of reads to the human genome (part 2)

Short read alignment Sequencer Human source Reads from new sequencing machines are short: 25-50 bp

Sequencing machine Short read alignment And you get MANY of them

Short read alignment Need to map them back to human reference

Bowtie • Ultrafast short read alignment software – designed for 25-63bp reads • Same sensitivity as Maq, but 35 times faster • Shares formats with Maq – compatible with Maq’s SNP caller • Open source: – http://cbcb.umd.edu/software – http://bowtie-bio.sourceforge.net

Bowtie overview • – – – For each read, finds a ‘good’ hit to the reference, allowing for mismatches Prefers mismatches at lower-quality bases Can behave like Maq or SOAP Calls SNPs using Maq interface • – – Uses Burrows-Wheeler index of the reference genome Pre-built genomes available Can download or build your own

Why Burrows-Wheeler?

• BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory • Linear-time search algorithm – proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT) BWT acaacg$ $acaac g aacg$a c acaacg $ acg$ac a caacg$ a cg$aca a g$acaa c gc$aaac Burrows-Wheeler Matrix (BWM)

Burrows-Wheeler Matrix $acaac g aacg$a c acaacg $ acg$ac a caacg$ a cg$aca a g$acaa c

Burrows-Wheeler Matrix $ acaacg aacg$ ac acaacg$ acg$ aca caacg$ a cg$ acaa g$ acaac See the suffix array?

Handling mismatches Matching acctagattcagaggtcaccataggcacatgcag Don’t backtrack to positions in this region of the read

Handling mismatches Matching acctagattcagaggtcaccataggcacatgcag Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read

Handling mismatches acctagattcagaggtcaccataggcacatgcag Flip the read and index around Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read gacgtacacggataccactggagacttagatcca

Handling mismatches acctagattcagaggtcaccataggcacatgcag Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read gacgtacacggataccactggagacttagatcca Matching

Handling mismatches • Bowtie uses a more complex scheme to allow for more than 1 mismatch – Divides the read into a 28bp assumed to be of high-quality “seed” region, which is – Divides that into two parts, similar to the 1-mismatch scheme – Allows backtracking in each part in separate phases to avoid excessive backtracking

Bowtie speed Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot 268 x 54 x Millions of reads per CPU hour

Bowtie memory requirements

(less is better) Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot Peak memory usage (megabytes)

Bowtie Sensitivity

Percent reads aligned

Maq SOAP Bowtie >90% of reads are aligned by all 3 programs 74.7

71.6

75.1

Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot

Bowtie index construction Maximum allowed memory (GB) Building index for NCBI human reference, build 36, on a 2.4 GHz Opteron with 32GB RAM

Bowtie index construction • Can build index for a mammalian genome on a desktop workstation in < 1 day • Pre-built indices at CBCB: H. sapiens M. musculus D. melanogaster S. cerevisiae others… 2.1 GB 1.8 GB 118 MB 12 MB

Acknowledgements

Assembly with short reads Dan Sommer, Daniela Puiu, Vincent Lee Short-read alignment (Bowtie) Ben Langmead, Cole Trapnell, Mihai Pop Funding NIH R01-LM06845, R01-GM083873

Assembling Genomes from Next-Generation Sequencers Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu.

Transcript Assembling Genomes from Next-Generation Sequencers Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu.