Next generation sequencing -

Download Report

Transcript Next generation sequencing -

Next-generation sequencing
– the informatics angle
Gabor T. Marth
Boston College Biology Department
AGBT 2008
Marco Island, FL. February 6. 2008
T1. Roche / 454 FLX system
• pyrosequencing technology
• variable read-length
• the only new technology with >100bp reads
• tested in many published applications
• supports paired-end read protocols with up to 10kb
separation size
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer
• read properties are very close traditional capillary
sequences
• very low INDEL error rate
• tested in many published applications
• paired-end read protocols support short (<600bp)
separation
T3. AB / SOLiD system
A
C
G
T
0
1
2
3
1
0
3
2
G
2
3
0
1
T
3
2
1
0
A
1st Base
• fixed-length short-read sequencer
• employs a 2-base encoding system
that can be used for error reduction
and improving SNP calling accuracy
• requires color-space informatics
• published applications underway /
in review
• paired-end read protocols support
up to 10kb separation size
2nd Base
C
T4. Helicos / Heliscope system
• experimental short-read
sequencer system
• single molecule sequencing
• no amplification
• variable read-length
• error rate reduced with 2pass template sequencing
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from
paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)
(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
A4. Novel transcript discovery (genes)
• novel genes / exons
• novel transcripts in
known genes
Inferred exon 1
Inferred exon 2
Known exon 1
Known exon 2
Known exon 1
Known exon 2
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
A6. Expression profiling by tag counting
gene
gene
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
A7. De novo organismal genome sequencing
Lander et al. Nature 2001
short reads
read pairs
longer reads
assembled sequence contigs
C1. Read length
20-35 (var)
25-35 (fixed)
25-40 (fixed)
~250 (var)
0
100
200
300
read length [bp]
When does read length matter?
• short reads often sufficient where the
entire read length can be used for mapping:
SNPs, short-INDELs, SVs
CHIP-SEQ
short RNA discovery
counting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttaca
gacttacatacgta
Known exon 1
Known exon 2
accgattactatacta
C2. Read error rate
• error rate typically 0.4 - 1%
• error
rate dictates how many
0.40
errors
the aligner should tolerate
0.35
Fraction of genome
0.30
0.25
0.20
• the more errors the aligner must
tolerate, the lower the fraction of the
reads that can be uniquely aligned
0.15
0.10
0.05
0.00
0
1
2
Number of mismatches allowed
• applications where, in addition,
specific alleles are essential, error rate is
even more important
C3. Error rate grows with each cycle
40
10.00%
9.00%
35
8.00%
30
7.00%
6.00%
• this phenomenon limits useful read
length
5.00%
20
4.00%
15
3.00%
10
2.00%
5
1.00%
0
0.00%
0
5
10
15
20
Position on Read
25
30
35
40
Error rate
Measured QV
25
C4. Substitutions vs. INDEL errors
• SNP discovery may require higher
coverage for allele confirmation
• INDELs can be discovered with
very high confidence!
• gapped alignment necessary
• good SNP discovery accuracy
• short-INDEL discovery difficult
C5. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing
error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
Quality values should be well-calibrated
assigned base quality value should be
calibrated to represent the actual base
quality value in every sequencing cycle
C6. Representational biases / library complexity
fragmentation biases
PCR
amplification biases
sequencing
low/no
representation
high
representation
sequencing biases
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)
• it has major impact is on counting applications
Amplification errors
early amplification error gets
propagated onto every clonal copy
many reads from
clonal copies of a
single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
C7. Paired-end reads
• fragment amplification:
fragment length 100 - 600 bp
• fragment length limited by
amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)
• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions
are required for both ends) or efficiency (if fragment length constraint is used to
rescue non-uniquely mapping ends)
Paired-end reads for SV discovery
• longer fragments increase the
chance of spanning SV breakpoints
and/or entire events
• longer fragments tend to have wider
fragment length distributions
• SV breakpoint detection sensitivity &
resolution depend on the width of the
fragment length distribution (most 2kb
deletions would be detected at 10% std
but missed at 30% std)
C8. Technologies / properties / applications
Technology
Roche/454
Illumina/Solexa
AB/SOLiD
Read length
250bp
20-40bp
25-35bp
Error rate
<0.5%
<1.0%
<0.5%
Dominant error type
INDEL
SUB
SUB
yes
yes
yes
< 10kb (3kb optimal)
100 - 600bp
500bp - 10kb (3kb optimal)
○
●
●
●
○
Read properties
Paired-end reads available
Paired-end separation
Applications
SNP discovery
short-INDEL discovery
SV discovery
○
○
●
CHIP-SEQ
○
●
●
small RNA/gene discovery
○
●
●
mRNA Xcript discovery
●
○
○
Expression profiling
○
●
●
De novo sequencing
●
?
?
Thanks
Michael Egholm
Clive Brown
David Bentley
Elaine Mardis
Francisco de la Vega
Kristen Stoops
Ed Thayer
MOSAIK talk Thursday, 7:40PM
Michael Stromberg
Michele
Busby
Aaron Quinlan
Eric Tsung
Derek Barnett
Chip
Stewart
Damien
Croteau-Chonka
Weichun Huang
http://bioinformatics.bc.edu/marthlab