Transcript siRNA & miRNA
RNA-Seq
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-seq Protocol Martin and Wang Nat. Rev. Genet. (2011) 2
RNA-seq Applications • Expression levels, differential expression • Alternative splicing, novel isoforms • Novel genes or transcripts, lncRNA • Detect gene fusions • Many different protocols • Can use on any sequenced genome • Better dynamic range, cleaner data 3
Experimental Design • Assessing biological variation requires biological replicates (no need for technical replicates) • 3 preferred, 2 OK, 1 only for exploratory assays (not good for publications) • For differential expression, don’t pool RNA from multiple biological replicates • Batch effects still exist, try to be consistent or process all samples at the same time 4
Experimental Design • Ribo-minus (remove too abundant genes) • PolyA (mRNA, enrich for exons) • Strand specific (anti-sense lncRNA) • Sequencing: – PE (resolve redundancy) or SE: expression – PE for splicing, novel transcripts – Depth: 30-50M differential expression, deeper transcript assembly – Read length: longer for transcript assembly 5
RNA-seq Analysis 6
Alignment • Prefer splice-aware aligners • TopHat, BWA, STAR (not DNASTAR) • Sometimes need to trim the beginning bases
Genome Alignment Gene Genome Splice-Aware Alignment Gene Versus
7
Transcript Assembly Reference-based assembly Cufflinks
De novo
assembly Trinity 8
Quality Control: RSeQC 9
Expression Index • RPKM (Reads per kilobase of transcript per million reads of library) – Corrects for coverage, gene length – 1 RPKM ~ 0.3 -1 transcript / cell – Comparable between different genes within the same dataset – TopHat / Cufflinks • FPKM (Fragments), PE libraries, RPKM/2 • TPM (transcripts per million) – Normalizes to transcript copies instead of reads – Longer transcripts have more reads – RSEM, HTSeq 10
Differential Expression 11
Sequencing Read Distribution • Poisson distribution: – # events within an interval • Sequencing data is overdispersed Poisson • Negative binomial – Def: # of successes before r failures occur, if Pb(each success) is p 12
Differential Expression • Negative binomial for RNA-seq • Variance estimated by borrowing information from all the genes – hierarchical models • Test whether μ i is the same for gene i between samples j • FDR?
13
Differential Expression • Should we do differential expression on RPKM/FPKM or TPM?
Gene A (1kb) Gene B (8kb) • Cufflinks: RPKM/FPKM • LIMMA-VOOM and DESeq: TPM • Power to detect DE is proportional to length • Continued development and updates 14
Alternative Splicing • Assign reads to splice isoforms Exon 1 Exon 1 Exon 2 Exon 3 Exon 3 Splice form 1 Splice form 2 Definitely splice form 1 Definitely splice form 2 Ambiguous 15
Isoform Inference • If given known set of isoforms • Estimate
x
to maximize the likelihood of observing
n
16
Known Isoform Abundance Inference 17
Isoform Inference • With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances have big uncertainty (e.g. known set incomplete) • De novo isoform inference is a
non identifiable
problem if RNA-seq reads are short and gene is long with too many exons • Algorithm: MATS 18
Gene Fusion • More seen in cancer samples • Still a bit hard to call • TopHatFusion in TopHat2 Maher et al, Nat 2009 19
Other Applications • RNA editing – Change on RNA sequence after transcription – Most frequent: A to I (behaves like G), C to U – Evolves from mononucleotide deaminases, might be involved in RNA degradation • Circular RNA – Mostly arise from splicing – Varying length, abundance, and stability – Possible function: sponge for RBP or miRNA 20
Summary • RNA-seq design considerations • Read mapping – TopHat, BWA, STAR • De novo transcriptome assembly: TRINITY • Expression index: FPKM and TPM • Differential expression – Cufflinks: versatile – LIMMA-VOOM and DESeq: better variance estimates • Alternative splicing: MATS • Gene fusion, genome editing, circular RNA 21
Acknowledgement • Alisha Holloway • Simon Andrews • Radhika Khetani 22