Transcript Slide 1

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 3
RNA-seq
Ryan Morin
On some specific slides I would
prefer you not reproduce the
material. In those cases I will
use this symbol in the right of
the slide.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
RNA-Seq
bioinformatics.ca
What to expect from this module
• A (hopefully-not-too) biased perspective of a cancer
researcher on the applications/proper analysis of RNAseq data, examples using Illumina data
• What RNA-seq data is (can be) used for
– Depends very heavily on what you are interested in
• Software options for processing RNA-seq data
– Many tools available, choice will depend heavily on interests/needs
– Benefits and pitfalls of individual analysis strategies
• Examples of analyzing data for various applications
– Expression, Alternative Splicing, SNVs, Fusion transcripts
RNA-Seq
bioinformatics.ca
What this module will not cover
• microRNA sequencing (miRNA-seq) data
– For a good example of cancer miRNA-seq data, I recommend a
recent paper by Daniela Witten et al, BMC Biol, 2010 PMID:
20459774.
• Strand-specific RNA-seq data
• Colourspace alignment or any other issues unique to
colourspace RNA-seq data
• The plethora of short-read aligners that are available exception: The few that are specifically tailored for RNAseq
• Things that I have forgotten to include…
RNA-Seq
bioinformatics.ca
What this module does cover
• Options and methods for successful alignment of RNAseq data
• Identification of Single Nucleotide Variants (SNVs) and
caveats specific to RNA-seq
• Analyzing aligned RNA-seq data for differential gene
expression using the R language
• Analyzing paired RNA-seq libraries for alternative
expression/splicing
• Discovery of chimeric transcripts from RNA-seq
RNA-Seq
bioinformatics.ca
Why sequence (cancer) transcriptomes?
• Annotation & Discovery
– New features of the transcriptome including genes, exons, splicing
events, ncRNAs, lincRNAs
– New transcript structures derived from chromosomal aberrations or
structural polymorphisms
– In cancer, novel events may reveal cancer drivers and/or may be
“biomarkers”
• Measurement
– Expression of individual genes/loci and the potential to quantitatively
discriminate isoforms using junction reads and coverage of individual
exons, introns etc.
– Presence, genotype, and expression of alleles (may be relevant to
disease, may reflect underlying epigenetic changes)
RNA-Seq
bioinformatics.ca
How RNA-seq data is generated
Isolate Transcript RNA
AAAAAA
AAAAAA
AAAAAA
AAAAAA
Reverse Transcription
Fragment cDNA
Size Selection
Illumina Sequencing of each end
CAGG
CAAA
GGAG
AAAA
*based on Illumina approach
**strand-specific RNA-seq protocols exist for both Illumina and SOLiD
Slide complements of Andrew McPherson
RNA-Seq
CTGG
GAAA
bioinformatics.ca
Numerous possible analysis strategies
• There is no one ‘correct’
way to analyze RNA-seq
data (though there are
some incorrect ways)
• Two major branches
– Direct alignment of reads
(spliced or unspliced) to
genome or transcriptome
– Assembly of reads followed
by alignment*
or transcriptome
Image from Haas & Zody, 2010
*Assembly is the only option when working with a creature with no genome sequence,
alignment of contigs may be to ESTs, cDNAs etc
RNA-Seq
bioinformatics.ca
Background: Aligning genomic DNA
sequence reads
• The first task after obtaining your reads (and any QC) is to determine
the corresponding site in the reference genome from which they each
derive
– Termed genome alignment (also known as ‘mapping’)
• Many next-gen aligners available, two main approaches
– Store ‘indexed’ genome in memory and map each read successively
(memory footprint proportional to genome size)
– Store indexed reads in memory and scan across reference genome
(memory footprint proportional to read number)
• Pairing information is used to select optimal placement of reads as a
pair
– may allow more mismatches, gapped alignment etc.
RNA-Seq
bioinformatics.ca
Common alignment conventions for
next-gen data
• Multi-map reads (i.e. reads that can be placed in multiple
locations with equal score) have only one location
reported
– Some aligners allow all alignments to be reported (or up to
some maximal number)
• Format/details of alignment output vary
– Many groups/applications are beginning to standardize towards
SAM format
• Efficient representation of alignment
• Fast retrieval, compatible with next-gen data viewers (e.g. IGV)
– BAM file is the starting point for most analyses
RNA-Seq
bioinformatics.ca
Drawbacks for each strategy
• Alignment to genome
– Computationally expensive
– It is never a good idea to simply align RNA-seq data to the genome Need a
spliced aligner or a surrogate (such as including exon-exon junction
sequences in ‘genome’)
• Alignment to transcriptome
– Reads deriving from non-genic structures may be ‘forcibly’ (and erroneously)
aligned to genes
• Incorrect gene expression values
• False positive SNVs
• Many other potential problems
• Assembly
– Low expression = difficult/impossible to assemble
– Misassemblies/fragmented contigs due to repeats
– Requires vast amounts of memory
RNA-Seq
bioinformatics.ca
Benefits of each approach
• Alignment to genome
– Allow reads from unannotated loci, introns et cetera to align to their
correct locations… potential for new biological insights
• Alignment to transcriptome
– Computationally inexpensive
– Spliced (exon junction) reads map correctly
– Pairing distance and junction reads may help distinguish individual
isoforms (informative/unique regions of transcripts)
• Assembly
– Can provide a more long-range view of transcripts
– Allows detection of chimeric transcripts and resolution of ‘breakpoints’
– May not necessarily need a genome
RNA-Seq
bioinformatics.ca
Why not treat RNA-seq reads like genomic
DNA sequence?
• Reads crossing one or more exon-exon junction will not align
properly to the genome (longer reads, bigger issue)
• Reads deriving from a fragment that spans multiple exons will
have a large pairing distance
– skews any estimation of fragment length
– Incompatible with any aligner requiring a sane pairing distance distribution
or hard limit on pairing distance
– More bias when correcting/compensating for gene length
How it should look
if aligned properly
large
pairing
distance
(expected)
How it will look
if aligned improperly
(only to genome)
RNA-Seq
bioinformatics.ca
Misalignments can cause many problems
and introduce bias
• Example of junction reads being forcibly aligned to exons
• Better alignments provide more signal and less noise
– Reduce false positive SNV calls
– Better resolution/discrimination of splicing events, expression
Correct alignments
Same data, aligned
with MAQ to the same
reference sequence
Systematic alignment
artefacts introduce false
positive SNV calls (boo!)
RNA-Seq
Truncating mutation in a
tumour suppressor gene
(yay!)
bioinformatics.ca
Solution 1: Include exon junction
sequences in genome
• A common strategy is to append all exon junction
sequences to reference genome (read length-1 from
either side of junction)
– Exact implementation varies (e.g. what to do for shorter exons?)
– For MAQ, concatenating them in a single block at the end of the
corresponding chromosome worked to a degree (considered the same
chromosome)
– Some aligners (e.g. BWA, MAQ)
that use pairing distance
Genome
to find optimal alignment
Transcript model
may force incorrect alignments
Exon junction
RNA-Seq
bioinformatics.ca
BWA can force incorrect alignments to
meet pairing criteria
• Even if a better junction alignment is available for a mate, BWA
may choose the suboptimal alignment that pairs two reads
correctly (was not designed for RNA-seq data)
– Can force a junction read to align to a nearby exon rather than the junction
sequence (which is considered very distant in the internal representation of the
genome)
– May force pairing between two junctions that are considered nearby even if
completely unrelated
>chr1
AGAGCACAT…
>junction1,geneA
FASTA ACTGACCTANNN
>junction2,geneB
CCGGATTAANNN
>junction3,geneC
GATTACAAANNN
RNA-Seq
BWA internal representation (concatenated genome)
AGAGCACAT…ACTGACCTANNNCCGGATTAANNNGATTACAAANNN
Internally, two junctions (here, from different genes)
may be considered a ‘fragment length’ apart
Note: A string of N’s should be included to prevent alignments
between neighboring sequences
bioinformatics.ca
Remedy for the pairing conundrum
• Modified BWA to understand what a junction sequence is
and where it belongs in the genome (Implementation by
Rodrgio Goya, BCGSC)
• When assessing pairing, BWA considers the parent
genomic position of either half of the junction rather
than artificial position in concatenated genome
• This is a work in progress, but be on the lookout for when
it is added to an upcoming BWA release
RNA-Seq
bioinformatics.ca
Solution 2: Align to genome and
transcriptome separately
• Reads are aligned to the genome and transcriptome as
separate processes
• Alignments to transcriptome must be re-positioned onto
their corresponding genomic location (representing
introns as large gaps)
• Compare the alignments for each read and choose best
alignment (either genomic or cDNA) based on
mismatches, pairing etc.
BEFORE: cDNA alignment (SAM format, columns 1-6)
SOLEXA12_1:7:94:1231:901
163
ENST00000320356 461
30
AFTER: ‘repositioned’ genomic alignment
SOLEXA12_1:7:94:1231:901
163
chr7
148157868 30
RNA-Seq
75M
Intron
6M2785N69M
bioinformatics.ca
Solution 3: One of a handful of spliced
aligners
– G-Mo.R-Se (Denoeud et al, 2008. PMID: 19087247)
– QPALMA (de Bona et al, 2008. PMID: 18689821)
– TopHat (Trapnell et al, 2009. PMID: 19289445)*
•
•
•
•
•
Maps read first to genome (using bowtie)
Builds database of possible exon junctions near putative exons
Uses pairing information to expand list of possible juncions
Aligns unaligned reads to putative exon junctions
Portions of reads mapping to discrete locations allow discovery of non-canonical
splice junctions
– SpliceMap (Fai Au et al, 2010. PMID: 20371516)*
• Maps reads in halves (>=50nt reads only ) and extends partial alignments
• Similar strategy to more recent versions of TopHat
– RNA-mate (Cloonan et al, 2010. PMID: 19648138)
• Iterative read trimming/re-mapping, Colourspace friendly
Other: BLAT, BFAST, GSNAP
*Output in SAM format available
RNA-Seq
bioinformatics.ca
TopHat with longer “short reads”
• New features in TopHat (not in paper) add additional
functionality with longer reads (and paired reads)
– Split portions of reads are mapped independently as 25mers
– Read coverage within potential introns no-longer restricts the
consideration of possible splice junctions
• This is important as many libraries have slight to extreme coverage of
introns
– Mates are used to guide where splice site could be
Un-sequenced portion
(length approximately known = L)
Read 1
Read 2
Unmapped
portion of read 1
Mapped portions of read 1
RNA-Seq
Mapped portions of read2
bioinformatics.ca
Solution 4: Assembly prior to/instead of
alignment
• A few de-novo assemblers have been shown to successfully
assemble RNA-seq data
– ABySS (Simpson et al, 2009. PMID: 19251739; Birol et al, 2009. PMID:
19528083), Velvet (Zerbino & Birney, 2008. PMID: 18349386)
• Pairing information can be leveraged to position contigs that are
adjacent (i.e. in the same transcript)
• After assembly, contigs are usually aligned to the reference
genome for further analysis (e.g. using BLAT)
– Novel splice junctions
– Chimeric transcripts from genomic rearrangements
– Un-annotated regions of genes (e.g. exons) or entirely novel genes
RNA-Seq
bioinformatics.ca
Identifying SNVs from alignments
(contrast with DNA sequencing)
• The same tools can be used to call variants from RNA-seq
alignments, pick your favourite
– e.g. SNVMix is a BAM-friendly SNV caller designed specifically
for identifying somatic mutations in cancer*
• Multiple caveats specific to RNA sequencing
– Allelic balance will not necessarily be the case (skew between
two alleles is commonly observed)
– SNVs are more likely to be false positives (better/more carefully
produced alignments should minimize this)
– Real (true positive) SNVs may be RNA edits (i.e. not in DNA)
* Rodrigo Goya, Sohrab Shah, http://compbio.bccrc.ca. (Goya et al. 2010. PMID:20130035)
RNA-Seq
bioinformatics.ca
Examples of pesky SNVs
Recurrent RNA edit in COG3
Junction reads misaligned
Skewed expression
Image from: Shah et al, 2010. PMID: 19812674
RNA-Seq
bioinformatics.ca
Getting expression data from alignments
• Expression values can be tabulated for individual gene
loci, transcripts, exons and splice junctions
• Gene expression values typically reported in RPKMFPKM
– Number of reads (actually fragments, for paired reads) per kb of exonic
bases per million reads in the library
– Compensates for variable library size and over-representation of reads from
longer transcripts
• Various software available
– ERANGE (Mortazavi et al, 2008. PMID: 18516045)
– Cufflinks (Trapnell et al, 2010. PMID: 20436464), probably supplants
ERANGE
– DEGseq package (Wang et al, 2010. PMID: 19855105)
– ALEXA-Seq (Griffith et al, in revision)
RNA-Seq
bioinformatics.ca
Analyzing RNA-seq data with R
• Three main packages exist that are specifically intended for
deep digital gene expression (DGE)
– DEGseq (Wang et al, 2010. PMID: 19855105)
• Specifically designed for RNA-seq data
• Contains implementations of other methods described for analyzing RNA-seq data
• Works on data with replicates or without (i.e. library pairs)
– edgeR (Robinson & Smythe, 2007. PMID: 19910308)
• Applicable for RNA-seq and other types of digital gene expression data
• Attempts to compensate for “overdispersion” unique to digital expression data
• Works on data with replicates or without (i.e. library pairs)
– goseq (Young et al, 2010. PMID: 20132535)
• Identifying over- or under-represented functional classes
• Takes list of differential genes as input, corrects for gene length bias
RNA-Seq
bioinformatics.ca
“Gene” expression is the tip of the
iceberg
• Abundance information from junction reads and
coverage of individual exons and introns is ignored by
many currently available approaches
– Information of expression of individual isoforms
– Retained introns and extended exons will be completely missed
– Ideally, read pairing distance could also be used to determine
the most likely isoform(s) present
RNA-Seq
bioinformatics.ca
ALEXA-Seq: ALternative EXpression
Analysis with RNA-Seq
• A software/database suite that enables detection and
quantitation of genes, exons and alternative splicing
events (known and some novel)
• Based on Ensembl gene annotations, can be applied to
any species in Ensembl
• Can detect may types of alternative splicing events
• Exon skips, retained introns, alternative exon
boundaries
• Major application is paired analysis of two libraries
(tissue A vs B, treated vs untreated etc)
RNA-Seq
bioinformatics.ca
ALEXA-seq preparation
RNA-Seq
bioinformatics.ca
ALEXA-seq Analysis Pipeline
RNA-Seq
bioinformatics.ca
www.alexaplatform.org
RNA-Seq
bioinformatics.ca
Multiple cancer datasets can be browsed
RNA-Seq
bioinformatics.ca
Differential alternative expression in an
‘evolved’ cancer cell line
RNA-Seq
bioinformatics.ca
Significant AE of UMPS in a drugresistant cancer cell line
Significant
events are
presented in
the context of
the gene model
Differentially abundant
exon junctions and exons
are highlighted
RNA-Seq
Griffith et al, 2008. PMID: 18235430
bioinformatics.ca
Identifying Genomic Rearrangements in
Cancer
gene X
gene Y
rearrangement
IGH
BCL2 IGH
BCL2 IGH
BCL2
IGH
BCL2
IGH
fusion X-Y
IGH
• Translocations, inversions and
deletions can bring two gene
loci in close proximity
• Transcriptional apparatus may
co-transcribe exons from each
gene resulting in a fusion
transcript
• Shown is a common
translocation found in
lymphoma, detectable by
RNA-seq
Image courtesy of Celie Goya Veyret
RNA-Seq
bioinformatics.ca
Fusion transcripts can be identified by
alignment or assembly
• deFuse* (Andrew McPherson, unpublished):
– Alignment to transcriptome
– Identification of putative fusions
– Targeted assembly of breakpoint/fusion sequence
• Trans-ABySS** (Gordon Robertson, unpublished)
– Full de novo assembly of entire RNA-seq data set (more depth, better
result)
– Assembly is run at many values of K (de Bruijn graph parameter)
– Contigs are pooled and aligned to genome to identify ‘split’ contigs
*deFuse available at http://compbio.bccrc.ca/?page_id=275
** Trans-ABySS available at http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
RNA-Seq
bioinformatics.ca
deFuse identifies chimeric transcripts using
transriptome alignment
chimeric transcripts
RNA-Seq
non-chimeric
reads
spanning
reads
split
reads
CTAG
CAAA
CTAG
CTGG
GGAG
CTAG
GGAG
CTAG
Reads
CAAA
GAAA
AAAA
CAAA
CTGG
GAAA
AAAA
CAAA
CTGG
Fragments
CTAG
CTAG
concordant
CTAG
GAAA
CTAG
GAAA
CTAG
GAAA
CAAA
CAAA
CTAG
GAAA
discordant
single end
anchored
Alignment
deFuse slides complements of Andrew McPherson
RNA-Seq
bioinformatics.ca
Overview of deFuse strategy
Input: All alignments to spliced and unspliced gene
sequences are considered
1. Generate fragment length distribution
2. Cluster discordant alignments
3. Resolve ambiguous alignments
4. Fusion boundary resolution using targeted split read
analysis
5. Corroborate spanning and split read evidence
RNA-Seq
bioinformatics.ca
Clustering of discordant pairs
Fusion boundary region (yellow):
Region in which the fusion boundary is expected to exist
Overlapping fusion boundary region condition:
Transcript X
Transcript Y
Transcript X
Transcript Y
Fragment length difference:
Fragment length difference can be calculated as abs(dX + dY)
dX
Transcript X
RNA-Seq
dY
Transcript Y
bioinformatics.ca
Resolving fusion sequence using split
read alignment
a. Calculate combined fusion boundary region of each cluster
Transcript X
SX
SY
Transcript Y
b. Find candidate discordant and single end anchored reads with
potential alignments to fusion boundary regions
Transcript X
SX
SY
Transcript Y
c. Align each candidate read to fusion boundary regions
• Dynamic programming based local alignment to find split positions
• Backtracking not required to find all split positions with maximum score
d. Cluster split positions by implied fusion
boundary
e. Select cluster with the most reads
split near the middle
RNA-Seq
SX
SY
bioinformatics.ca
RNA-seq as an annotation and gene
discovery tool
• We often will not know what we are looking for until we
have found it!
• There is interest in discovering the remaining elusive
genes/transcripts/splicing events
• Improved annotations are needed before we can
properly and fully utilize RNA-seq data
– e.g. how can we detect a gene as differentially expressed if we
don’t yet know it is a gene?
RNA-Seq
bioinformatics.ca
“Scripture” pipeline to identify new
genes
• Noncoding genes are much
more difficult to predict (denovo) and to annotate
• lincRNA genes are often
spliced but are poorly
annotated (discovered only a
few years ago)
• Spliced aligner (such as
TopHat) helps identify spliced
mRNAs
• Alignments are arranged into
quantitative transcript models
Image from Guttman et al, 2010. PMID: 20436462
RNA-Seq
bioinformatics.ca
Software Summary
BWA, BWA*
Novoalign,
MOSAIK
Bowtie, TopHat
SpliceMap,
RNA-Mate,
QPALMA
ABySS,
Velvet
Scripture
Cufflinks
ALEXA-Seq
Trans-ABySS
(uses BLAT)
Image from Haas & Zody, 2010
*Modified exon-junction ‘aware’ BWA, soon to be added to BWA release
RNA-Seq
bioinformatics.ca
Key Acknowledgements
• Andrew McPherson, PhD Student with Cenk Sahinalp: for
slides, early access, and support in running deFuse
• Malachi Griffith, PhD (Post Doc with Marco Marra): for
creating the VM and helping run the ALEXA-Seq pipeline
• Rodrigo Goya, PhD student with Marco Marra: for hacking
BWA, writing SNVmix and general technical support
• All the BC Genome Sciences Centre staff and affiliates
• CIHR and MSFHR for scholarship funding
RNA-Seq
bioinformatics.ca
We are on a Coffee Break &
Networking Session
RNA-Seq
bioinformatics.ca