Estimation of alternative splicing isoform frequencies from RNA

Download Report

Transcript Estimation of alternative splicing isoform frequencies from RNA

Estimation of alternative splicing isoform frequencies from RNA-Seq data

Marius Nicolae

Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky

Outline • • • • Introduction EM Algorithm Experimental results Conclusions and future work

[Griffith and Marra 07] Alternative Splicing

RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) A B C Isoform Expression (IE) D E Isoform Discovery (ID) C A A D B C E

Gene Expression Challenges • Read ambiguity (multireads) A B C D E • What is the gene length?

Previous approaches to GE • • • • Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE A A B C C D E

Previous approaches to IE • • • • • [Jiang&Wong 09] – Poisson model + importance sampling, single reads [ Richard et al. 10] — EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution

Our contribution • EM Algorithm for IE – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Hexamer bias correction – Annotated repeats correction

Read-Isoform Compatibility

w r

,

i w r

,

i

 

a O a Q a F a

Fragment length distribution • Paired reads

F a (i)

i j

F a (j)

Fragment length distribution • Single reads i j

F a (i) F a (j)

E-step M-step IsoEM algorithm

Speed improvements • Collapse identical reads into read classes Reads LCA(i3,i4) Isoforms i1 i2 i3 i4 i5 i6

Speed improvements • Collapse identical reads into read classes i2 i4 i1 i3 i5 i6 Isoforms • Run EM on connected components, in parallel

Simulation setup • • • Human genome UCSC known isoforms 100000 25000 20000 10000 15000 1000 10000 100 5000 10 0 10 100 1000

Isoform length

10000 100000 1 0 5 10 15 20 25 30 35 40 45 50 55

Number of isoforms

GNFAtlas2 gene expression levels – Uniform/geometric expression of gene isoforms Normally distributed fragment lengths – Mean 250, std. dev. 25

Accuracy measures • • • Error Fraction (EF t ) – Percentage of isoforms (or genes) with relative error larger than given threshold t Median Percent Error (MPE) – Threshold t for which EF is 50% r 2

Error Fraction Curves - Isoforms • 30M single reads of length 25 100 90 40 30 20 10 80 70 60 50 0 0 0,2 0,4 0,6

Relative error threshold

0,8 1 Uniq Rescue UniqLN Cufflinks RSEM IsoEM

Error Fraction Curves - Genes • 30M single reads of length 25 100 90 80 70 60 50 40 30 20 10 0 0 0,2 0,4 0,6

Relative error threshold

0,8 1 Uniq Rescue GeneEM Cufflinks RSEM IsoEM

MPE and EF 15 by Gene Frequency • 30M single reads of length 25

Read Length Effect • 0,978 0,976 Fixed sequencing throughput (750Mb) 25 20 0,974 0,972 15 0,97 0,968 10 0,966 0,964 0,962 25 35 45 55 65

Read length

75 Paired reads Single reads 85 95 5 0 25 35 45 55 65

Read length

75 Paired reads Single reads 85 95

Effect of Pairs & Strand Information • 1-60M 75bp reads 0,985 0,98 0,975 0,97 0,965 0,96 0,955 0,95 0,945 0 RandomStrand-Pairs CodingStrand-pairs RandomStrand-Single CodingStrand-single 10 000 000 20 000 000 30 000 000

# reads

40 000 000 50 000 000 60 000 000

Validation on Human RNA-Seq Data • • ≈8 million 27bp reads from two cell lines [Sultan et al. 10] 47 genes measured by qPCR [Richard et al. 10]

Cufflinks

3 2,5 2 1,5 1 1 1,5 2 2,5

Log(qPCR)

3 R² = 0,4604 3,5 4

IsoEM

3 2,5 2 1,5 1 1 1,5 2 2,5

Log(qPCR)

3 R² = 0,5505 3,5 4

Runtime scalability • Scalability experiments conducted on a Dell PowerEdge R900 – Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal memory 160 140 120 100 RandomStrand-Pairs CodingStrand-Pairs RandomStrand-Single CodingStrand-Single 80 60 40 20 0 0 20 30

Conclusions & Future Work • Presented EM algorithm for estimating isoform/gene expression levels – Integrates fragment length distribution, base qualities, pair and strand info – Java implementation available at http://dna.engr.uconn.edu/software/IsoEM/ • Ongoing work – Comparison of RNA-Seq with DGE – – Isoform discovery Reconstruction & frequency estimation for virus quasispecies

Acknowledgments  NSF awards 0546457 & 0916948 to IM and 0916401 to AZ