An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia.
Download ReportTranscript An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia State University, **University of Connecticut Classification and Existing Approaches In this work, we propose a novel statistical “genome guided” method called “Transcriptome Reconstruction using Integer Programming” (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based an exact annotation of exon boundaries and RNA-Seq reads. The exact annotation of exons can be obtained from annotation databases (e.g., Ensembl) or can be inferred from aligned RNASeq reads. A splice graph is a directed acyclic graph (DAG), whose vertices represent exons and edges represent splicing events. We enumerate all maximal paths in the splice graph using a depth-first-search (DFS) algorithm. These paths correspond to putative transcripts and are the input for the TRIP algorithm. Splice Graph Simulation Setup GIR : Genome independent reconstruction (de novo) – k-mer graph – Scripture(2010) • Reports “all” transcripts Cufflinks(2010), IsoLasso(2011), SLIDE(2012) • Reports a minimal set of transcripts – pseudo-exons single reads Trinity(2011), Velvet(2008), TransABySS(2008) • Euler/de Brujin k-mer graph 25000 20000 15000 10000 5000 0 10 1 AGR : Annotation-guided reconstruction – Explicitly use existing annotation during assembly – Human genome UCSC annotations Genome GGR : Genome-guided reconstruction (ab initio) – Exon identification – Genome-guided assembly – exons Number of transcripts ABSTRACT 2 3 6 5 4 7 8 9 t2 How to filter? INTRODUCTION t1 Transcript length Accuracy measures Sensitivity: - portion of the annotated transcripts being captured by assembly method TP Sens . TP FN 500 Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Mean : 500; Std. dev. 50 1 2 3 200 200 200 Garber, M. et al. Nat. Biotechnol. June 2011 t1 : 1 2 2 3 3 4 4 5 5 6 6 t2 : 3 4 5 6 t4 : 1 1 2 3 3 4 4 5 5 7 7 RNA-Seq Sequence fragment ends Map reads A Transcriptome Reconstruction A B A C D E B C D transcript/Transcript Expression E Gene Expression C Transcriptome Reconstruction Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Filter candidate transcripts – fragment length distribution (FLD) – integer programming Cufflinks 1 TRIP splice graph 0.8 0.6 0.4 0.2 0 1 min y(t ) 2 3 4 5 6 7 # of transcripts per gene 8 Flowchart for TRIP: (a) Positive Predictive Value (PPV) 7 (1) 1.2 y(t ) x( p),p 1 tT ( p ) (2) p x( p ) N reads n( s1 ) 7 IP formulation TRIP Map the RNA-Seq reads to genome 1.2 1 std. dev. Transcriptome Reconstruction using Integer Programming Make cDNA & shatter into fragments 100x coverage, 2x100bp pe reads annotations for genes and exon boundaries tT 7 Exon 2 and 6 are “distant” exons : how to phase them? [Griffith and Marra 07] 200 Naïve IP formulation Constraints: t3 : 200 Preliminary Results (in silico) Objective: 1 3 Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x(p) - 1 if the pe read p is selected to be mapped within 1 std. dev. – n(s1) - expected portion of reads mapped within 1 std. dev. Alternative Splicing 1 Mean : 500; Std. dev. 50 1 TP PPV TP FP PPV Ion Proton Sequencer Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length 300 Challenges and Solutions Read length is currently much shorter then transcripts length Statistical reconstruction method - fragment length distribution PPV - portion of annotated transcript among assembled transcripts Sensitivity SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length http://www.economist.com/node/16349358 10000 100000 t3 Select the smallest set of putative transcripts that yields a good statistical fit between – empirically determined during library preparation – implied by “mapping” read pairs Advances in Next Generation Sequencing 1000 GNFAtlas2 gene expression levels – Uniform/geometric expression of gene transcripts Normally distributed fragment lengths – Mean 500, std. dev. 50 RABT(2011) • Simulate reads from annotated transcripts GGR vs. GIR 100 Constraints: tT (1) Cufflinks splice graph 1 1,2,3,4 std. dev. i (2) N reads (n( si ) ) p xi ( p) N reads (n( si ) ) Genome TRIP 0.4 0 y(t ) x ( p), p, i 1,4 tTi ( p ) 0.6 0.2 min y(t ) Objective: 0.8 (3)i xi ( p ) 1, p, i 1,4 Variables: – Ti(p) - set of candidate transcripts on which pe read p can be mapped with fragment length between i-1 and i std. dev. , i=1,2,3,4 – y(t) -1 if a candidate transcript t is selected, 0 otherwise – xi(p) - 1 if the pe read p is selected to be mapped with fragment length between i-1 and i std. dev. , i=1,2,3,4 – n(si) - expected portion of reads mapped with fragment length between i-1 and i std. dev. , i=1,2,3,4 2 3 4 5 6 7 # of transcripts per gene 8 Flowchart for TRIP: (b) Sensitivity CONCLUSIONS AND FUTURE WORK TRIP Algorithm for novel transcript reconstruction – fragments length distribution Ongoing work – Comparison to other transcriptome reconstructions methods – Comparison on real datasets • Solid pe reads • Illumina 100x2 pe reads ACKNOWLEDGEMENTS • NSF award IIS-0916401 • NSF award IIS-0916948 • Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture • Second Century Initiative Bioinformatics University Doctoral Fellowship, Georgia State University