RNA seq slides

Download Report

Transcript RNA seq slides

Introduction to
Next-Generation Sequencing
KIHOON YOON, PH.D.
DEPT OF EPIDEMIOLOGY & BIOSTATISTICS
SCHOOL OF MEDICINE
UNIVERSITY OF TEXAS HEALTH SCIENCE
CENTER AT SAN ANTONIO
Outline
2
 Sequencing technologies
 Applications
 Bioinformatics tools for short-read sequencing
 Examples of Applications: ChIP-Seq /RNA-Seq
Sequencing technologies
3
 Next-next….-generation: how many ‘next’s are there?
 First Generation: automated version of Sanger sequencing (DNA-sequencing
method invented by Fred Sanger in the 1970s)



Take 500 days to read one Giga (billion) base (Gb) (1/3 of human genome)
1000 bases per read / Cost is high - $0.50 per 1000 bases
Second Generation



Roche/454 sequencing machine from 454 Life Science (2005)
 450 bases per read / $0.02 per 1000 bases / 2 days per Gb
Solexa from Illumina (2006)
 75 bases per read / $0.001 per 1000 bases / 0.5 days per Gb
SOLiD from Applied Biosystem (2006)
 50 bases per read / $0.001 per 1ooo bases / 0.5 days per Gb
 Next-Next-Gen – Third Generation?
 HiSeq2000 from Illumina – 0.04 days per Gb
 Helicos HeliscopeTM (www.helicosbio.com)

Pacific Biosciences SMRT (www.pacificbiosciences.com)
First vs Second Generation
4
Figure 1 from Shendure & Ji, 2008
Second Generation Sequencing
5
454, SOLiD
Solexa
Figure 2 from Shendure & Ji, 2008
NGS
6
 A typical procedure:
1.
Sequencing

How deep?
Alignment
2.

References, assemble or both
Experimental specific analysis
3.

A ‘one-size-fits-all’ program does not exist
Applications
7
 De novo sequence assembly
 Whole Genome Assembly
 Transcriptome Assembly
 Short Sequence Alignment
 Single read
 Paired read
 Genomic Variation Detection
 Detection of Single Nucleotide Polymorphism (SNP)
 Detection of Alternative Splicing Event
 Detection of major/minor transcript isoforms
Applications
8
RNA-Seq
Table 2 from Shendure & Ji, 2008
Bioinformatics Tools
9
Table 3 from Shendure & Ji, 2008
File Format
10
 Sequence Reads
 fastq
 fasta
 Alignment
 Sequence Alignment Map (SAM)


BAM


http://samtools.sourceforge.net/SAM1.pdf
http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.ht
m
Samtools: http://samtools.sourceforge.net/
Data: Sequence Reads
11
 Size of raw data
A challenge call for
a new compression
algorithm
Data: Sequence Reads
12
 Examples from Illumina sequcing read file - fastq
Line 1:
Line 2:
Line 3:
Line 4:
@EAS042_0001:1:1:1061:20798#0/1
TNTCTGTGTCCTGGGGCATCAATGATAGTCACATAGTACTTGCTGGTCTCAAATTTCCACAAGGAGATATCAATGG
+EAS042_0001:1:1:1061:20798#0/1
aB\^^Y]a^]cde`daaYaaa_bc\\`b^Y\a\aaUQY\]a\`aa\W__]HVZ]VQF^[`UH]\J^F^T^\\I]__
Line 1
EAS042_0001
the unique instrument name
1
flowcell lane
2
tile number within the flowcell lane
1061
'x'-coordinate of the cluster within
the tile
20798
'y'-coordinate of the cluster within
the tile
#0
index number for a multiplexed
sample (0 for no indexing)
/1
the member of a pair, /1 or
/2 (paired-end or mate-pair reads
only)
Line 2: raw sequence
Line 3: + ?
Line 4: sequence quality score
from -5 to 62
using ASCII 59 to 126
Will Lossy Compression work?
Example of Applications
13
 ChIP-Seq
 allows you to assay the amount of binding and location of a
protein to DNA, such as a transcription factor bound to the
start site of a gene, or a histones of a certain type.
 RNA-Seq
 Transcriptome sequencing
 Substantial challenges exist for annotation
 Should be able to reconstruct transcripts & accurately measure
their relative abundance w/o reference to an annotated
genome
ChIP-Seq
14
Chromatin immunoprecipitation (ChIP)
followed by high-throughput sequencing
Figure 1 from Mardis, 2007
ChIP-Seq
15
 ChIP-chip: ChIP is coupled
to DNA hybridization
array (chip) technology

This is the closest
methodology to ChIP-seq, but
its mapping precision is lower,
and the dynamic range of the
readout is significantly less.
Comparison of ChIP-seq and ChIP-chip.
Representative signals from ChIP-seq (solid line) and
ChIP-chip (dashed line) show both greater dynamic
range and higher resolution with ChIP-seq. Whereas
three binding peaks are identified using ChIP-seq, only
one broad peak is detected using ChIP-chip.
Liu et al. BMC Biology 2010 8:56 doi:10.1186/17417007-8-56
ChIP-Seq
16
 Three key steps
antibody selection – most crucial
actual sequencing, which is subject to several possible biases
algorithmic analysis, including mapping and peak-calling.
1.
2.
3.

short tags (around 25 to 35 bp) can be ambiguous in regions of high
homology or in repeat regions
 Align and Pick-calling to detect active binding sites


Alignment tools: BWA, MAQ, SOAP ….
a large number of free and commercial peak-calling software
packages: MACS, SICER, PeakSeq, SISSR, F-seq
Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and
RNA-seq studies. Nat Methods 2009 , 6:S22-S32.
 Barski A, Zhao K: Genomic location analysis by ChIP-Seq. J Cell
Biochem 2009 , 107:11-18.

ChIP-Seq
17
Shirley Pepke, Barbara Wold & Ali
Mortazavi
Nature Methods 6, S22 - S32
(2009) Published online: 15
October 2009
doi:10.1038/nmeth.1371
ChIP-Seq: Wilbanks et al.
18
 Wilbanks EG, Facciotti MT (2010) Evaluation of Algorithm Performance in
ChIP-Seq Peak Detection. PLoS ONE 5(7): e11471.
doi:10.1371/journal.pone.0011471
Figure 1
ChIP-Seq: Wilbanks et al.
19
ChIP-Seq: Wilbanks et al.
20
Figure 7. Positional accuracy and
precision. The distance between the
predicted binding site and high
confidence motif occurrences within
250 bp was calcualted for different
peak calling programs in the (A)
NRSF….
ChIP-Seq: Wilbanks et al.
21
 Conclusion: It is a hard problem!



Balance b/w sensitivity & specificity in compiling the final candidate
peak list is desired
High false positives!
“We suggest that rather than focus solely on algorithmic
development, equal or better gains could be made through careful
consideration of experimental design and further development
of sample preparations to reduce noise in the datasets.”
 New methods do not always give us clear ideas about the
outcome….

Biologists do not think analysis part in advance, and quantitative
scientists absolutely don’t have any idea to recommend on their
experiments. And, the results of experiments are likely to be
inclusive!
RNA-Seq
22
Transcriptiome Analysis
Figure 5 | Overview of RNA-Seq. A RNA fraction of
interest is selected, fragmented and reverse
transcribed. The resulting cDNA can then be
sequenced using any of the current ultra-highthroughput technologies to obtain ten to a hundred
million reads, which are then mapped back onto the
genome. The reads are then analyzed to calculate
expression levels.
Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009) Published
online: 15 October 2009
doi:10.1038/nmeth.1371
RNA-Seq: Strategies
23
Figure 1 from Hass & Zody, 2010
RNA-Seq: Strategies
24
 Alignment Strategy

Align to transcriptome


Align to genome and exon-exon
junction sequences


no new transcript discovery
extremely large search space due
to all possible exon combinations
De novo assembly


Cufflink
Scripture
Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009)
Published online: 15 October 2009
doi:10.1038/nmeth.1371
RNA-Seq
25
 two major objectives of RNA-Seq experiments:
 Identification of novel transcripts from the locations of regions
covered in the mapping.
 Estimation of the abundance of the transcripts from their
depth of coverage in the mapping.
TopHat/Cufflink
26
 Cole Trapnell, Lior Pachter, and Steven L. Salzberg, TopHat:
discovering splice junctions with RNA-Seq Bioinformatics (2009)
25(9): 1105-1111 doi:10.1093/bioinformatics/btp120
 Cole Trapnell,Brian A Williams,Geo Pertea,Ali Mortazavi,Gordon
Kwan,Marijke J van Baren,Steven L Salzberg,Barbara J Wold& Lior,
Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell
differentiation, Nature Biotechnology, Vol: 28, 511–515 (2010)
TopHat/Cufflink
Trapnell et al., 2010
Trapnell et al., 2009
27
Scripture
28
• Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey,
James Robinson, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas
Gnirke, Chad Nusbaum, John L Rinn, Eric S Lander & Aviv
Regevaregev, Ab initio reconstruction of cell type–specific
transcriptomes in mouse reveals the conserved multi-exonic structure
of lincRNAs. Nature Biotechnology. Vol: 28, 503–510 (2010)
Scripture
Figure 1
29
Figure 2
Guttman et al., 2010
RNA-Seq Software
30
Shirley Pepke, Barbara Wold & Ali Mortazavi
Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009
doi:10.1038/nmeth.1371
Quantitation
31
 Metric for RNA-Seq Expression

RPKM

Reads per kilobase per million reads
1.
Count the number of reads which map to constitutive exon
bodies. The set of constitutive exons was derived from Ensembl
genes (hg18, UCSC genome browser), where an exon was defined
to be constitutive if present in all transcripts for a given gene
2.
Determine the number of uniquely mappable positions in the
same set of constitutive exons. "Uniquely mappable" was defined
as being a unique 32-mer in the genome and our junction
database.
3.
Count the total number of uniquely mapping reads in each tissue
or sample.
4.
Compute RPKM as the number of reads which map per kilobase
of exon model per million mapped reads for each gene, for each
tissue or sample.
RNA-Seq
32
 De novo assembly algorithms
 Post-transcriptional regulation
References
33
 Metzker, M.L. (2010) Sequencing technologies - the next





generation. Nat Rev Genet, 11, 31-46.
Mardis, E.R. (2008) Next-generation DNA sequencing
methods. Annu Rev Genom Hum G, 9, 387-402.
Shendure, J. and Ji, H.L. (2008) Next-generation DNA
sequencing. Nat Biotechnol, 26, 1135-1145.
Mardis, E.R. (2007) ChIP-seq: welcome to the new
frontier. Nat Methods, 4, 613-614.
Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq:
a revolutionary tool for transcriptomics. Nature Reviews
Genetics, 10, 57-63.
Haas, B.J. and Zody, M.C. (2010) Advancing RNA-Seq
analysis. Nature Biotechnology 28, 421–423.
Question?
34