The Past, Present, and Future of DNA Sequencing

Download Report

Transcript The Past, Present, and Future of DNA Sequencing

The Past, Present, and Future of DNA Sequencing

Craig A. Praul Co- Director Genomics Core Facility Huck Institutes of the Life Sciences Penn State University

A very short history of DNA sequencing

I started from the conviction that, if different DNA species exhibited different biological activities, there should also exist chemically demonstrable differences between deoxyribonucleic acids.

Edwin Chargaff

Milestones

• • • • • • First Isolation of DNA : 1867 (Freidrich Meisher) Composition of nucleic acids; tetranucleotide theory : 1909 - 1940 (Phoebus Levine) G=C and A=T however, the G/C and A/T content of different organisms vary : 1950 (Edwin Chargaff) G/C content measured by annealing : 1968 (Mandel and Marmur) Maxam-Gilbert and Sanger Sequencing : 1977 Next-Generation Sequencing : 2005

Genomes Sequenced

• Virus – 3222 (Bacteriophage phi X 174, 5386 nt – 1977) • Bacteria – 2289 (Haemophilus influenza, 1.8 x 10 6 nt – 1995) • Eukarya – 168 (S. cerevisiae 1.2 x 10 7 nt – 1995; H. sapien, 3 x 10 9 nt -2001) • Archaea – 152 (Methanococcus jannaschi , 1.7 x 10 6 nt – 1996)

Next-Generation Sequencing

Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364

Changes in instrument capacity*

ER Mardis. Nature 470 , 198-203 (2011) doi:10.1038/nature09796

Sequencing Cost

Date Sep-01 Sep-02 Oct-03 Oct-04 Oct-05 Oct-06 Oct-07 Oct-08 Oct-09 Oct-10 Oct-11 Oct-12 Jan-13 Cost per Mb $5,292.39

$3,413.80

$2,230.98

$1,028.85

$766.73

$581.92

$397.09

$3.81

$0.78

$0.32

$0.09

$0.07

$0.06

Cost per Genome $95,263,072 $61,448,422 $40,157,554 $18,519,312 $13,801,124 $10,474,556 $7,147,571 $342,502 $70,333 $29,092 $7,743 $6,618 $5,671 Source - NHGRI : http://www.genome.gov/sequencingcosts/

Central Dogma of Molecular Biology James Watson version - 1965

DNA RNA

So once we have the genomic DNA sequence of a species we have all of the information there is?

Really?

Protein

No, not really.

Illumina HiSeq and MiSeq

• Massively parallel – HiSeq : 150 or 180 million reads per lane – MiSeq : 15 million reads per run • Intermediate Read Length – HiSeq : 100 nt or 150 nt – MiSeq : 250 nt • High total output per run – HiSeq : 90 GB or 288 GB – MiSeq : 8 GB

Single Read Paired-end read

Sequencing Types

Mate-pair read

Library Types

• Many different library preps : DNA, mate-pair, mRNA, miRNA, ChIP • Fragmentation – DNA : 300 – 500 nt – RNA : 150 – 200 nt • Attachment of appropriate adapters – Complex : flow cell binding, F & R sequencing, BC – Custom : Avoid if possible • Removal of dimers/small inserts • Amplification (or not)

Applications

• de Novo sequencing (genomes, transcriptomes) • Resequencing (genomes, exomes, custom sequence capture) • RNA-seq (mRNA, miRNA, degradome) • Chip-Seq • Methyl-seq • RIP-seq • Amplicon

de Novo Experimental Design

• Estimate of genome size • Coverage (30 x – 100 x) • Sequencing Type (paired-end or mate-pair) • Example 100 MB genome, 100 x 100 nt paired-end reads – (100 MB) x (30 x coverage) = 3 GB – 3 GB / (200 nt for each pair of paired-end reads) = 15 million read pairs • Replicates

Resequencing : Sequence Capture

RNA-seq Experimental Design

• Estimate of transcriptome size (1-5% of genome ?) • Coverage (30 x ?) – mRNA or rRNA depleted RNA – Relative abundance of transcripts you are interested in • Sequencing Type (single read or paired-end) – Simple transcriptome vs. complex transcriptome – Splice variants • Example 3 GB genome, 100 nt single reads – (3 GB genome) x ( 5% transcriptome ) = 120 MB Transcriptome – (120 MB transcriptome) x (30 x coverage) = 4.5 GB total sequence – 4.5 GB / (100 nt for each read) = 45 million read pairs • Replicates : Yes!!!!

– Biological not technical

ChIP-Seq

http://www.nature.com/nmeth/journal/v4/n8/images/nmeth0807-613-F1.gif

RIP-seq Source : http://openi.nlm.nih.gov/imgs/rescaled512/3269675_ijms-13-00097f6.png

Methyl-seq 20 different types of base modifications in DNA are known and there are perhaps 200 modifications of RNA

Experimental Space: Next-Gen Platform

• PacBio : 0.075 x 10 6 – Whole transcript reads/sample, 1000 – 3000 nt • Roche 454 FLX+ : 0.5 -1 x 10 6 – reads/sample, 800 -1000 nt Small – Medium Genome de novo sequencing – Long Amplicon – Transcriptome • PGM: 1-2 x 10 6 – reads per sample, 400 nt Small genome de novo – Medium Amplicon • MiSeq: 1-2 x 10 6 – reads per sample, 50 – 250 nt Small genome de Novo – Small Amplicon • HiSeq : 10-100 x 10 6 – reads per sample, 50 – 150 nt Counting Applications : RNA-seq, ChIP-seq, RIP-seq, Methyl-seq – Large genome de novo and resequencing

Experimental Space: The Relevancy of “Classic” Techniques

Differential Gene Expression • Northern blotting (1977) : 1 Probe – 20 samples • Dot Blots (1987) : 100s of probes – 1 sample • RT-PCR (1992) : 100s of probes – 10 -100 samples • Microarrays (1995 ) : 100,000s of probes – 1 sample • Next-gen sequencing (2005) : 10-100 x 10 6 reads – 1 sample

The Future • More Reads • Longer Reads • Faster Sequencing • Cheaper Sequencing • New Applications