Trends in DNA Sequencing Technology

Download Report

Transcript Trends in DNA Sequencing Technology

Previous Lecture: Gene Expression
This Lecture
Next-Generation
DNA Sequencing Technology
Stuart M. Brown, Ph.D
Learning Objectives
• Illumina Sequencing Technology:
• shotgun, amplicon, selected
•
•
•
•
•
•
Quality Scores
Data formats
Barcodes and multiplexing
Read alignment
Data storage
Applications: mutation finding, gene expression, proteinDNA interactions, new genomes, metagenomics, etc.
DNA Sequencing is a rapidly rising
Biomedical Research Technology
Sanger Sequencing: 1975
• DNA sequencing technology is improving at a
phenomenal rate
• The output of the machines is doubling and
the cost is falling by half in each of the past 5
years (2x $/base ~ 6 months)
• This is much faster than the improvement in
computing technology
(Moore’s law = doubling every 2 years)
• The impact on biomedical research is
fundamental rather than incremental
454 Technology
Illumina Sequencing
200-500 million reads per sample
Sequencing by
synthesis
!
http://technology.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html
Build Sequencing Libraries
• Shotgun method: fragment genome (or RNA), attach linkers
• PCR method: amplify targets; include linkers in PCR primers
• Selection: fragment genome, use known probes to fish out target
regions; “Exome capture”
Barcode Multiplexing
[Dozens, hundreds, even thousands of samples in one lane]
PCR amplicon
RNA-seq workflow
Paired-End
Reads
Key Points for Illumina Tech.
• Massively parallel
• Amplification from single molecules to clusters is just to increase
fluorescent signal strength
• Reads one base at a time from all fragments
• A ‘cycle’ is the separate addition (and wash away) for each of: ‘G’
+ ‘A’ + ‘T’ + ‘C’ reagents
• Paired-ends come from the same cloned molecule
• Barcode indexes have their own primer (and their own data file)
• Limitations are:
•
•
•
•
optics (resolution and sensitivity)
Image processing (resolve overlapping clusters)
Size of flowcell (number of clusters)
Fidelity of reagents (always add base, always block, always remove block)
Current NGS Technology
• Illumina HiSeq 2500:
– 1.5 billion reads per sample
2 X 100 bases = 450 Gbp
– or 400 million 2 x 50 bp reads
in 16 hours = 30 Gbp)
– Reagent cost ~$8K per run
• Other machine vendors have fallen
behind
• New technologies constantly in
development by many companies
Illumina Runtime QC metrics
NGS File formats
• Raw data from various vendors
• Different quality metrics (with different levels
of empirical validity)
• GenBank formats (SRA)
• Alignments = SAM/BAM
• Genome Browser formats (wig, bed, gff, etc)
• Variants (SNPs, indels, etc)
FASTQ format:
sequence + qualilty
@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAA
GGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC
+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEE
GGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII
@SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
NATTTTTACTAGTTTATTCTAGAACAGAGCATAAACTACTATTCAATAAACGTATGAAGCACTACTCACCTCCATTAACAT
GACGTTTTTCCCTAATCTGATGGGTCATTATGACCAGAGTATTGCCGCGGTGGAAATGGAGGTGAGTAGTG
+SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
#--+83355@@@CC@C22@@C@@CC@@C@@@CC@@@@@@@@@@@@C?C22@@C@:::::@@@@@@C@@@@@@@@CIGIHIIDGI
GIIIIHHIIHGHHIIHHIFIIIIIHIIIIIIBIIIEIFGIIIFGFIBGDGGGGGGFIGDIFGADGAE
@SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCC
TCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT
+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
#.,')-2-/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGG
DGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG
Illumina Quality Scores
• Based on the Phred scoring system
– Probability of error
– Log probability Q = –10 log10(P)
• Q30 = 0.001 (a 1/1000 chance of error)
• Q20 = 0.01 (a 1/100 chance of error)
• Based on empirical properties of the data (intensity of cluster, signal-tonoise ratio), combined with observations of actual error rates for known
standard samples.
• The calculation method is essentially arbitrary (by Illumina techs), and
changes with every iteration of software, chemistry, and hardware on the
sequencing machine.
– Some software (GATC) recalibrates Illumina Q scores to be more realistic.
• Q score is converted to an ascii value (to use a single character) in the
FASTQ file
• Q scores currently use more data storage space (8 bits) than the bases (2
bits). This will change, eventually.
Preliminary QC (pre-alignment)
from FASTQC
NGS Applications
Personalized Medicine
 Designing Medicine that is custom tailored to a patient’s genetic profile
Cancer Genomics
What turns a normal cell into a cancer cell?
Epidemiology/Forensics
Pathogen outbreaks, criminal cases or NSA
Metagenomics
The human body is home to an enormous number and diversity of microbes.
Crop Improvement
Sequencing of important crop genomes to obtain desirable traits that can increase the
world food supply and ultimately impact human health
Lower Cost = More Innovation
• As sequencing becomes cheaper, more
investigators can use it for routine assays
• Leads to variations and absolutely novel
applications
Lower Cost = More samples
• More patients in GWAS studies
• More replicates (or the use of some replicates
and statistical approaches) in all other assays
More Investigators = Less Informatics skill
• Sequencing is a readout for many different
types of laboratory experiments
• Clinical and basic science investigators from all
areas of biology can make use of this
technology
• Many are completely naïve about
bioinformatics
• Informatics tools for NGS are very challenging
Bioinformatics is the Bottleneck
• Sequencing is a commodity – can easily be
outsourced
• Bioinformatics is the essential point of the
science
• Data analysis and discovery of meaning in raw
results
• As the data throughput increases, the cost and
time spent on analysis increase more than
linearly
Challenging Bioinformatics Environment
• Very rapid change in technology platform
– New file formats, new data types
– Different “standards” from different vendors
• Very rapid evolution of new methods
• Very rapid ‘release’ of methods as
‘software’ via unsupported open source
distribution
• Large data sizes (both experimental and
reference)
Data Analysis Pipeline
Images
Intensities
Reads
Alignments
“The Illumina Genome Analyzer Pipeline software is a highly
customizable analysis engine capable of taking the raw image data
generated by the Genome Analyzer and producing intensity scores,
base calls and quality metrics, and quality scored alignments. It was
developed in collaboration with many of the world’s leading
sequencing centers and is scalable to meet the needs of even the
most prodigious facilities.”
Zuojian Tang
Stein Genome Biology 2010 11:207
NYULMC High Performance Computing Facility
HiSeq 2000
2 FENs
60 CNs
5 GPU Nodes
4 High Mem Nodes
1 FPGA
Cluster NAS
Primary Data Storage Cluster Asclepius HPC Cluster
200 TB
60
GA IIx
Back up Data Storage Cluster
200 TB
Internal Cloud
LIMS
Roche/454 FLX
Gbrowse
Genome
alignment
Gene expression
Variant detection
etc.
OCS / Genome Technology Center
CHIBI / Sequencing Informatics
Alignment
• Most NGS experiments involve sequence
reads collected from known genomes: gene
expression, variants, ChIP-seq, etc.
• First step in data analysis is to algin reads to
the reference genome.
• This is computationally demanding due to the
huge volume of NGS data and the large size of
the reference genome (human = 3.2 Gb)
Alignment
– Rapidly map many millions of short reads onto the
genome (must be much faster than BLAST)
– Finds perfect, 1, and 2 mismatch alignments; small
indels
– Aligns 80-90% of PF reads to human/mouse genome
– Mapping problems – many reads map to multiple
locations on the genome.
• What to do if there are many equally good matches –
randomly pick one?
• If there is one “best” match is it important if there are 2 (or
50) other matches that are almost as good?
NGS Alignment
• Pre-NGS alignment algorithms such as SmithWaterman and BLAST are both too slow and too
sensitive (able to align sequences with many
mismatches)
• NGS reads should be near-perfect matches to the
reference genome.
• Need to process hundreds of millions of reads in
a few hours.
– time is critical because alignment is used for QC of
NGS experiments, lab may be waiting before running
next sample
SAM format:
aligned to reference genome
MCL-SRR350952.1 99
chr13 28330526 29
76M =
28330636 183
NAAAGACACAGTTACATGAAGAACATACTCCTCTCTCAGACTGCCCAGGTTCAGTGATTCATTCAACAAACTTTAT
#,())*3--.@@@@@@@@@@<<<::87999<<<<<@@@@@@@:@@<:<<<<<8<<::7::::::::@@@22::::@
XT:A:U
NM:i:1
SM:i:29
AM:i:29
X0:i:1 X1:i:0 XM:i:1
XO:i:0 XG:i:0 MD:Z:0T75
XA:Z:chr13,+28330526,76M,1;
MCL-SRR350952.1 147 chr13 28330636 29
73M3S
=
28330526 -183
TATAGGATTCAACTGTGAGAAAGACATATTAATCTCTTCCATTGTGCAGACTACATTCTTTTTTTTTTTTTTTGAG
#####################B7:?2,?8;;@+=+A@3DEBB?2?5B7=A?=?4<8;;AIGIDIHFAIIGHIIBII
XT:A:M
NM:i:2
SM:i:29
AM:i:29
XM:i:2
XO:i:0 XG:i:0 MD:Z:18A18G35
MCL-SRR350952.2 99
chr9 16437227 60
76M =
16437304 153
NGAAATGCAAGGCTGTTTGGGATGTTTTCGAAGTGATGAATGCTGGAAGGATTGCTGTTCTCTAAGTGAGCAAGGA
############################################################################ XT:A:U
NM:i:1
SM:i:37
AM:i:37
X0:i:1 X1:i:0 XM:i:1
XO:i:0 XG:i:0 MD:Z:0A75
XA:Z:chr9,+16437227,76M,1;
MCL-SRR350952.2 147 chr9 16437304 60
76M =
16437227 -153
GCTGGGACTCCTGGTGCGATTATTGCTCTCAATGAAAGTCCTTATATCTGAGTCTGTCTTTGAAGATGGTACAGCC
DBAEA<>G>GGDGG<DD3E<CDCC>E?D?E@DGDDDG8BGGGEGEG@E@@BCCFEE,IHIIGEGGEFCCF<FFFFD
XT:A:U
NM:i:0
SM:i:37
AM:i:37
X0:i:1 X1:i:0 XM:i:0
XO:i:0 XG:i:0 MD:Z:76
XA:Z:chr9,-16437304,76M,0;
NGS Data Output
Whole Human Genome Sequencing Requires ~30x Coverage
3.2 GB X 30 = ~100 GB
Why so much?
 Uneven Coverage - Poisson distribution of small DNA reads
 Sequencing Errors -machine/chemistry ( ~ 1%  30Mbp)
 Systematic Biases - some regions are harder to sequence
 Alignment Problems - gaps, repeats, etc.
 Quality Factor - additional data/metadata
ChIP-seq
Immunoprecipitate
High-throughput sequencing
Map sequence tags
to genome
Release DNA
ChIP-seq Challenges
• We want to find the peaks (enriched regions =
protein binding sites on genome)
• Goals include: accuracy (location of peak on
genome), sensitivity, & reproducibility
• Challenges: non-random background, PCR
artifacts, difficult to estimate false negatives
• Not truly quantitative, yet we want to find
differences due to experimental factors
RNA-seq Gene Expression
RNA-seq informatics
•
•
•
•
•
•
Filter out rRNA
Align to genome – with intron splicing
Compute differential expression
Alternatively spliced exons
Sequence variants (SNPs, indels)
Gene fusions/translocations
de novo Assembly of Bacterial Genomes
Bacterial genome = 1 to 5 million bases.
Output of one HiSeq lane = 10 billion bases.
One lane ($1000) gives 100x coverage of of 100 bacterial genomes.
Assembly software: Velvet and ABYSS for fragments, MAUVE for contigs.
Summary
• Illumina Sequencing Technology:
• shotgun, amplicon, selected
•
•
•
•
•
•
Quality Scores
Data formats
Barcodes and multiplexing
Read alignment
Data storage
Applications: mutation finding, gene expression, proteinDNA interactions, new genomes, metagenomics, etc.
Next Lecture: NGS Alignment & ChIP-seq