Transcript Slide 1
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 3 RNA-seq Ryan Morin On some specific slides I would prefer you not reproduce the material. In those cases I will use this symbol in the right of the slide. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites RNA-Seq bioinformatics.ca What to expect from this module • A (hopefully-not-too) biased perspective of a cancer researcher on the applications/proper analysis of RNAseq data, examples using Illumina data • What RNA-seq data is (can be) used for – Depends very heavily on what you are interested in • Software options for processing RNA-seq data – Many tools available, choice will depend heavily on interests/needs – Benefits and pitfalls of individual analysis strategies • Examples of analyzing data for various applications – Expression, Alternative Splicing, SNVs, Fusion transcripts RNA-Seq bioinformatics.ca What this module will not cover • microRNA sequencing (miRNA-seq) data – For a good example of cancer miRNA-seq data, I recommend a recent paper by Daniela Witten et al, BMC Biol, 2010 PMID: 20459774. • Strand-specific RNA-seq data • Colourspace alignment or any other issues unique to colourspace RNA-seq data • The plethora of short-read aligners that are available exception: The few that are specifically tailored for RNAseq • Things that I have forgotten to include… RNA-Seq bioinformatics.ca What this module does cover • Options and methods for successful alignment of RNAseq data • Identification of Single Nucleotide Variants (SNVs) and caveats specific to RNA-seq • Analyzing aligned RNA-seq data for differential gene expression using the R language • Analyzing paired RNA-seq libraries for alternative expression/splicing • Discovery of chimeric transcripts from RNA-seq RNA-Seq bioinformatics.ca Why sequence (cancer) transcriptomes? • Annotation & Discovery – New features of the transcriptome including genes, exons, splicing events, ncRNAs, lincRNAs – New transcript structures derived from chromosomal aberrations or structural polymorphisms – In cancer, novel events may reveal cancer drivers and/or may be “biomarkers” • Measurement – Expression of individual genes/loci and the potential to quantitatively discriminate isoforms using junction reads and coverage of individual exons, introns etc. – Presence, genotype, and expression of alleles (may be relevant to disease, may reflect underlying epigenetic changes) RNA-Seq bioinformatics.ca How RNA-seq data is generated Isolate Transcript RNA AAAAAA AAAAAA AAAAAA AAAAAA Reverse Transcription Fragment cDNA Size Selection Illumina Sequencing of each end CAGG CAAA GGAG AAAA *based on Illumina approach **strand-specific RNA-seq protocols exist for both Illumina and SOLiD Slide complements of Andrew McPherson RNA-Seq CTGG GAAA bioinformatics.ca Numerous possible analysis strategies • There is no one ‘correct’ way to analyze RNA-seq data (though there are some incorrect ways) • Two major branches – Direct alignment of reads (spliced or unspliced) to genome or transcriptome – Assembly of reads followed by alignment* or transcriptome Image from Haas & Zody, 2010 *Assembly is the only option when working with a creature with no genome sequence, alignment of contigs may be to ESTs, cDNAs etc RNA-Seq bioinformatics.ca Background: Aligning genomic DNA sequence reads • The first task after obtaining your reads (and any QC) is to determine the corresponding site in the reference genome from which they each derive – Termed genome alignment (also known as ‘mapping’) • Many next-gen aligners available, two main approaches – Store ‘indexed’ genome in memory and map each read successively (memory footprint proportional to genome size) – Store indexed reads in memory and scan across reference genome (memory footprint proportional to read number) • Pairing information is used to select optimal placement of reads as a pair – may allow more mismatches, gapped alignment etc. RNA-Seq bioinformatics.ca Common alignment conventions for next-gen data • Multi-map reads (i.e. reads that can be placed in multiple locations with equal score) have only one location reported – Some aligners allow all alignments to be reported (or up to some maximal number) • Format/details of alignment output vary – Many groups/applications are beginning to standardize towards SAM format • Efficient representation of alignment • Fast retrieval, compatible with next-gen data viewers (e.g. IGV) – BAM file is the starting point for most analyses RNA-Seq bioinformatics.ca Drawbacks for each strategy • Alignment to genome – Computationally expensive – It is never a good idea to simply align RNA-seq data to the genome Need a spliced aligner or a surrogate (such as including exon-exon junction sequences in ‘genome’) • Alignment to transcriptome – Reads deriving from non-genic structures may be ‘forcibly’ (and erroneously) aligned to genes • Incorrect gene expression values • False positive SNVs • Many other potential problems • Assembly – Low expression = difficult/impossible to assemble – Misassemblies/fragmented contigs due to repeats – Requires vast amounts of memory RNA-Seq bioinformatics.ca Benefits of each approach • Alignment to genome – Allow reads from unannotated loci, introns et cetera to align to their correct locations… potential for new biological insights • Alignment to transcriptome – Computationally inexpensive – Spliced (exon junction) reads map correctly – Pairing distance and junction reads may help distinguish individual isoforms (informative/unique regions of transcripts) • Assembly – Can provide a more long-range view of transcripts – Allows detection of chimeric transcripts and resolution of ‘breakpoints’ – May not necessarily need a genome RNA-Seq bioinformatics.ca Why not treat RNA-seq reads like genomic DNA sequence? • Reads crossing one or more exon-exon junction will not align properly to the genome (longer reads, bigger issue) • Reads deriving from a fragment that spans multiple exons will have a large pairing distance – skews any estimation of fragment length – Incompatible with any aligner requiring a sane pairing distance distribution or hard limit on pairing distance – More bias when correcting/compensating for gene length How it should look if aligned properly large pairing distance (expected) How it will look if aligned improperly (only to genome) RNA-Seq bioinformatics.ca Misalignments can cause many problems and introduce bias • Example of junction reads being forcibly aligned to exons • Better alignments provide more signal and less noise – Reduce false positive SNV calls – Better resolution/discrimination of splicing events, expression Correct alignments Same data, aligned with MAQ to the same reference sequence Systematic alignment artefacts introduce false positive SNV calls (boo!) RNA-Seq Truncating mutation in a tumour suppressor gene (yay!) bioinformatics.ca Solution 1: Include exon junction sequences in genome • A common strategy is to append all exon junction sequences to reference genome (read length-1 from either side of junction) – Exact implementation varies (e.g. what to do for shorter exons?) – For MAQ, concatenating them in a single block at the end of the corresponding chromosome worked to a degree (considered the same chromosome) – Some aligners (e.g. BWA, MAQ) that use pairing distance Genome to find optimal alignment Transcript model may force incorrect alignments Exon junction RNA-Seq bioinformatics.ca BWA can force incorrect alignments to meet pairing criteria • Even if a better junction alignment is available for a mate, BWA may choose the suboptimal alignment that pairs two reads correctly (was not designed for RNA-seq data) – Can force a junction read to align to a nearby exon rather than the junction sequence (which is considered very distant in the internal representation of the genome) – May force pairing between two junctions that are considered nearby even if completely unrelated >chr1 AGAGCACAT… >junction1,geneA FASTA ACTGACCTANNN >junction2,geneB CCGGATTAANNN >junction3,geneC GATTACAAANNN RNA-Seq BWA internal representation (concatenated genome) AGAGCACAT…ACTGACCTANNNCCGGATTAANNNGATTACAAANNN Internally, two junctions (here, from different genes) may be considered a ‘fragment length’ apart Note: A string of N’s should be included to prevent alignments between neighboring sequences bioinformatics.ca Remedy for the pairing conundrum • Modified BWA to understand what a junction sequence is and where it belongs in the genome (Implementation by Rodrgio Goya, BCGSC) • When assessing pairing, BWA considers the parent genomic position of either half of the junction rather than artificial position in concatenated genome • This is a work in progress, but be on the lookout for when it is added to an upcoming BWA release RNA-Seq bioinformatics.ca Solution 2: Align to genome and transcriptome separately • Reads are aligned to the genome and transcriptome as separate processes • Alignments to transcriptome must be re-positioned onto their corresponding genomic location (representing introns as large gaps) • Compare the alignments for each read and choose best alignment (either genomic or cDNA) based on mismatches, pairing etc. BEFORE: cDNA alignment (SAM format, columns 1-6) SOLEXA12_1:7:94:1231:901 163 ENST00000320356 461 30 AFTER: ‘repositioned’ genomic alignment SOLEXA12_1:7:94:1231:901 163 chr7 148157868 30 RNA-Seq 75M Intron 6M2785N69M bioinformatics.ca Solution 3: One of a handful of spliced aligners – G-Mo.R-Se (Denoeud et al, 2008. PMID: 19087247) – QPALMA (de Bona et al, 2008. PMID: 18689821) – TopHat (Trapnell et al, 2009. PMID: 19289445)* • • • • • Maps read first to genome (using bowtie) Builds database of possible exon junctions near putative exons Uses pairing information to expand list of possible juncions Aligns unaligned reads to putative exon junctions Portions of reads mapping to discrete locations allow discovery of non-canonical splice junctions – SpliceMap (Fai Au et al, 2010. PMID: 20371516)* • Maps reads in halves (>=50nt reads only ) and extends partial alignments • Similar strategy to more recent versions of TopHat – RNA-mate (Cloonan et al, 2010. PMID: 19648138) • Iterative read trimming/re-mapping, Colourspace friendly Other: BLAT, BFAST, GSNAP *Output in SAM format available RNA-Seq bioinformatics.ca TopHat with longer “short reads” • New features in TopHat (not in paper) add additional functionality with longer reads (and paired reads) – Split portions of reads are mapped independently as 25mers – Read coverage within potential introns no-longer restricts the consideration of possible splice junctions • This is important as many libraries have slight to extreme coverage of introns – Mates are used to guide where splice site could be Un-sequenced portion (length approximately known = L) Read 1 Read 2 Unmapped portion of read 1 Mapped portions of read 1 RNA-Seq Mapped portions of read2 bioinformatics.ca Solution 4: Assembly prior to/instead of alignment • A few de-novo assemblers have been shown to successfully assemble RNA-seq data – ABySS (Simpson et al, 2009. PMID: 19251739; Birol et al, 2009. PMID: 19528083), Velvet (Zerbino & Birney, 2008. PMID: 18349386) • Pairing information can be leveraged to position contigs that are adjacent (i.e. in the same transcript) • After assembly, contigs are usually aligned to the reference genome for further analysis (e.g. using BLAT) – Novel splice junctions – Chimeric transcripts from genomic rearrangements – Un-annotated regions of genes (e.g. exons) or entirely novel genes RNA-Seq bioinformatics.ca Identifying SNVs from alignments (contrast with DNA sequencing) • The same tools can be used to call variants from RNA-seq alignments, pick your favourite – e.g. SNVMix is a BAM-friendly SNV caller designed specifically for identifying somatic mutations in cancer* • Multiple caveats specific to RNA sequencing – Allelic balance will not necessarily be the case (skew between two alleles is commonly observed) – SNVs are more likely to be false positives (better/more carefully produced alignments should minimize this) – Real (true positive) SNVs may be RNA edits (i.e. not in DNA) * Rodrigo Goya, Sohrab Shah, http://compbio.bccrc.ca. (Goya et al. 2010. PMID:20130035) RNA-Seq bioinformatics.ca Examples of pesky SNVs Recurrent RNA edit in COG3 Junction reads misaligned Skewed expression Image from: Shah et al, 2010. PMID: 19812674 RNA-Seq bioinformatics.ca Getting expression data from alignments • Expression values can be tabulated for individual gene loci, transcripts, exons and splice junctions • Gene expression values typically reported in RPKMFPKM – Number of reads (actually fragments, for paired reads) per kb of exonic bases per million reads in the library – Compensates for variable library size and over-representation of reads from longer transcripts • Various software available – ERANGE (Mortazavi et al, 2008. PMID: 18516045) – Cufflinks (Trapnell et al, 2010. PMID: 20436464), probably supplants ERANGE – DEGseq package (Wang et al, 2010. PMID: 19855105) – ALEXA-Seq (Griffith et al, in revision) RNA-Seq bioinformatics.ca Analyzing RNA-seq data with R • Three main packages exist that are specifically intended for deep digital gene expression (DGE) – DEGseq (Wang et al, 2010. PMID: 19855105) • Specifically designed for RNA-seq data • Contains implementations of other methods described for analyzing RNA-seq data • Works on data with replicates or without (i.e. library pairs) – edgeR (Robinson & Smythe, 2007. PMID: 19910308) • Applicable for RNA-seq and other types of digital gene expression data • Attempts to compensate for “overdispersion” unique to digital expression data • Works on data with replicates or without (i.e. library pairs) – goseq (Young et al, 2010. PMID: 20132535) • Identifying over- or under-represented functional classes • Takes list of differential genes as input, corrects for gene length bias RNA-Seq bioinformatics.ca “Gene” expression is the tip of the iceberg • Abundance information from junction reads and coverage of individual exons and introns is ignored by many currently available approaches – Information of expression of individual isoforms – Retained introns and extended exons will be completely missed – Ideally, read pairing distance could also be used to determine the most likely isoform(s) present RNA-Seq bioinformatics.ca ALEXA-Seq: ALternative EXpression Analysis with RNA-Seq • A software/database suite that enables detection and quantitation of genes, exons and alternative splicing events (known and some novel) • Based on Ensembl gene annotations, can be applied to any species in Ensembl • Can detect may types of alternative splicing events • Exon skips, retained introns, alternative exon boundaries • Major application is paired analysis of two libraries (tissue A vs B, treated vs untreated etc) RNA-Seq bioinformatics.ca ALEXA-seq preparation RNA-Seq bioinformatics.ca ALEXA-seq Analysis Pipeline RNA-Seq bioinformatics.ca www.alexaplatform.org RNA-Seq bioinformatics.ca Multiple cancer datasets can be browsed RNA-Seq bioinformatics.ca Differential alternative expression in an ‘evolved’ cancer cell line RNA-Seq bioinformatics.ca Significant AE of UMPS in a drugresistant cancer cell line Significant events are presented in the context of the gene model Differentially abundant exon junctions and exons are highlighted RNA-Seq Griffith et al, 2008. PMID: 18235430 bioinformatics.ca Identifying Genomic Rearrangements in Cancer gene X gene Y rearrangement IGH BCL2 IGH BCL2 IGH BCL2 IGH BCL2 IGH fusion X-Y IGH • Translocations, inversions and deletions can bring two gene loci in close proximity • Transcriptional apparatus may co-transcribe exons from each gene resulting in a fusion transcript • Shown is a common translocation found in lymphoma, detectable by RNA-seq Image courtesy of Celie Goya Veyret RNA-Seq bioinformatics.ca Fusion transcripts can be identified by alignment or assembly • deFuse* (Andrew McPherson, unpublished): – Alignment to transcriptome – Identification of putative fusions – Targeted assembly of breakpoint/fusion sequence • Trans-ABySS** (Gordon Robertson, unpublished) – Full de novo assembly of entire RNA-seq data set (more depth, better result) – Assembly is run at many values of K (de Bruijn graph parameter) – Contigs are pooled and aligned to genome to identify ‘split’ contigs *deFuse available at http://compbio.bccrc.ca/?page_id=275 ** Trans-ABySS available at http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss RNA-Seq bioinformatics.ca deFuse identifies chimeric transcripts using transriptome alignment chimeric transcripts RNA-Seq non-chimeric reads spanning reads split reads CTAG CAAA CTAG CTGG GGAG CTAG GGAG CTAG Reads CAAA GAAA AAAA CAAA CTGG GAAA AAAA CAAA CTGG Fragments CTAG CTAG concordant CTAG GAAA CTAG GAAA CTAG GAAA CAAA CAAA CTAG GAAA discordant single end anchored Alignment deFuse slides complements of Andrew McPherson RNA-Seq bioinformatics.ca Overview of deFuse strategy Input: All alignments to spliced and unspliced gene sequences are considered 1. Generate fragment length distribution 2. Cluster discordant alignments 3. Resolve ambiguous alignments 4. Fusion boundary resolution using targeted split read analysis 5. Corroborate spanning and split read evidence RNA-Seq bioinformatics.ca Clustering of discordant pairs Fusion boundary region (yellow): Region in which the fusion boundary is expected to exist Overlapping fusion boundary region condition: Transcript X Transcript Y Transcript X Transcript Y Fragment length difference: Fragment length difference can be calculated as abs(dX + dY) dX Transcript X RNA-Seq dY Transcript Y bioinformatics.ca Resolving fusion sequence using split read alignment a. Calculate combined fusion boundary region of each cluster Transcript X SX SY Transcript Y b. Find candidate discordant and single end anchored reads with potential alignments to fusion boundary regions Transcript X SX SY Transcript Y c. Align each candidate read to fusion boundary regions • Dynamic programming based local alignment to find split positions • Backtracking not required to find all split positions with maximum score d. Cluster split positions by implied fusion boundary e. Select cluster with the most reads split near the middle RNA-Seq SX SY bioinformatics.ca RNA-seq as an annotation and gene discovery tool • We often will not know what we are looking for until we have found it! • There is interest in discovering the remaining elusive genes/transcripts/splicing events • Improved annotations are needed before we can properly and fully utilize RNA-seq data – e.g. how can we detect a gene as differentially expressed if we don’t yet know it is a gene? RNA-Seq bioinformatics.ca “Scripture” pipeline to identify new genes • Noncoding genes are much more difficult to predict (denovo) and to annotate • lincRNA genes are often spliced but are poorly annotated (discovered only a few years ago) • Spliced aligner (such as TopHat) helps identify spliced mRNAs • Alignments are arranged into quantitative transcript models Image from Guttman et al, 2010. PMID: 20436462 RNA-Seq bioinformatics.ca Software Summary BWA, BWA* Novoalign, MOSAIK Bowtie, TopHat SpliceMap, RNA-Mate, QPALMA ABySS, Velvet Scripture Cufflinks ALEXA-Seq Trans-ABySS (uses BLAT) Image from Haas & Zody, 2010 *Modified exon-junction ‘aware’ BWA, soon to be added to BWA release RNA-Seq bioinformatics.ca Key Acknowledgements • Andrew McPherson, PhD Student with Cenk Sahinalp: for slides, early access, and support in running deFuse • Malachi Griffith, PhD (Post Doc with Marco Marra): for creating the VM and helping run the ALEXA-Seq pipeline • Rodrigo Goya, PhD student with Marco Marra: for hacking BWA, writing SNVmix and general technical support • All the BC Genome Sciences Centre staff and affiliates • CIHR and MSFHR for scholarship funding RNA-Seq bioinformatics.ca We are on a Coffee Break & Networking Session RNA-Seq bioinformatics.ca