An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia.

Download Report

Transcript An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia.

An Integer Programming Approach to Novel Transcript
Reconstruction from Paired-End RNA-Seq Reads
Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and
Alexander Zelikovsky*
*Georgia State University, **University of Connecticut
Classification and Existing Approaches
In this work, we propose a novel statistical
“genome guided” method called “Transcriptome
Reconstruction using Integer Programming”
(TRIP) that incorporates fragment length
distribution into novel transcript reconstruction
from paired-end RNA-Seq reads. To reconstruct
novel transcripts, we create a splice graph based
an exact annotation of exon boundaries and
RNA-Seq reads. The exact annotation of exons
can be obtained from annotation databases (e.g.,
Ensembl) or can be inferred from aligned RNASeq reads. A splice graph is a directed acyclic
graph (DAG), whose vertices represent exons
and edges represent splicing events. We
enumerate all maximal paths in the splice graph
using a depth-first-search (DFS) algorithm. These
paths correspond to putative transcripts and are
the input for the TRIP algorithm.
Splice Graph
Simulation Setup
GIR : Genome independent reconstruction (de novo)
– k-mer graph
–
Scripture(2010)
•
Reports “all” transcripts
Cufflinks(2010), IsoLasso(2011), SLIDE(2012)
•
Reports a minimal set of transcripts
–
pseudo-exons
single reads
Trinity(2011), Velvet(2008), TransABySS(2008)
•
Euler/de Brujin k-mer graph
25000
20000
15000
10000
5000
0
10
1
AGR : Annotation-guided reconstruction
– Explicitly use existing annotation during
assembly
–
Human genome UCSC annotations
Genome
GGR : Genome-guided reconstruction (ab initio)
– Exon identification
– Genome-guided assembly
–
exons
Number of transcripts
ABSTRACT
2
3
6
5
4
7
8
9
t2
How to filter?
INTRODUCTION
t1
Transcript length
Accuracy measures
Sensitivity:
- portion of the annotated transcripts being
captured by assembly method
TP
Sens . 
TP  FN
500
Roche/454 FLX Titanium
400-600 million reads/run
400bp avg. length
Mean : 500; Std. dev. 50
1
2
3
200
200
200
Garber, M. et al. Nat. Biotechnol.
June 2011
t1 :
1
2
2
3
3
4
4
5
5
6
6
t2 :
3
4
5
6
t4 :
1
1
2
3
3
4
4
5
5
7
7
RNA-Seq
Sequence fragment ends
Map reads
A
Transcriptome
Reconstruction
A
B
A
C
D
E
B
C
D
transcript/Transcript Expression
E
Gene
Expression
C
Transcriptome Reconstruction
Given partial or incomplete information about
something, use that information to make an
informed guess about the missing or unknown
data.
 Construct Splice Graph - G(V,E)
– V : exons
– E: splicing events
 Candidate transcripts
– depth-first-search (DFS)
 Filter candidate transcripts
– fragment length
distribution (FLD)
– integer programming
Cufflinks
1
TRIP
splice graph
0.8
0.6
0.4
0.2
0
1
min  y(t )
2
3
4
5
6
7
# of transcripts per gene
8
Flowchart for TRIP: (a) Positive Predictive Value (PPV)
7
(1)
1.2
 y(t ) x( p),p
1
tT ( p )
(2) p x( p )  N reads n( s1 )
7
IP formulation
TRIP
 Map the RNA-Seq reads to
genome
1.2
1 std. dev.
Transcriptome Reconstruction using Integer Programming
Make cDNA & shatter into fragments
100x coverage, 2x100bp pe reads
annotations for genes and exon boundaries
tT
7
Exon 2 and 6 are “distant” exons : how to phase them?
[Griffith and Marra 07]
200
Naïve IP formulation
Constraints:
t3 :
200
Preliminary Results (in silico)
Objective:
1
3
Variables:
– T(p) - set of candidate transcripts on which pe read
p can be mapped within 1 std. dev.
– y(t) -1 if a candidate transcript t is selected, 0
otherwise
– x(p) - 1 if the pe read p is selected to be mapped
within 1 std. dev.
– n(s1) - expected portion of reads mapped within 1
std. dev.
Alternative Splicing
1
Mean : 500; Std. dev. 50
1
TP
PPV 
TP  FP
PPV
Ion Proton Sequencer
Illumina HiSeq 2000
Up to 6 billion PE reads/run
35-100bp read length
300
Challenges and Solutions
 Read length is currently much shorter then
transcripts length
 Statistical reconstruction method
- fragment length distribution
PPV
- portion of annotated transcript among
assembled transcripts
Sensitivity
SOLiD 4/5500
1.4-2.4 billion PE reads/run
35-50bp read length
http://www.economist.com/node/16349358
10000 100000
t3
Select the smallest set of putative transcripts that yields
a good statistical fit between
– empirically determined during library
preparation
– implied by “mapping” read pairs
Advances in Next Generation Sequencing
1000
 GNFAtlas2 gene expression levels
– Uniform/geometric expression of gene
transcripts
 Normally distributed fragment lengths
–
Mean 500, std. dev. 50
RABT(2011)
•
Simulate reads from annotated transcripts
GGR vs. GIR
100
Constraints:
tT
(1)
Cufflinks
splice graph
1
1,2,3,4 std. dev.
i
(2) N reads (n( si )   )   p xi ( p)  N reads (n( si )   )
Genome
TRIP
0.4
0
 y(t ) x ( p), p, i  1,4
tTi ( p )
0.6
0.2
min  y(t )
Objective:
0.8
(3)i xi ( p )  1, p, i  1,4
Variables:
– Ti(p) - set of candidate transcripts on which pe read
p can be mapped with fragment length between i-1
and i std. dev. , i=1,2,3,4
– y(t) -1 if a candidate transcript t is selected, 0
otherwise
– xi(p) - 1 if the pe read p is selected to be mapped
with fragment length between i-1 and i std. dev. ,
i=1,2,3,4
– n(si) - expected portion of reads mapped with
fragment length between i-1 and i std. dev. ,
i=1,2,3,4
2
3
4
5
6
7
# of transcripts per gene
8
Flowchart for TRIP: (b) Sensitivity
CONCLUSIONS AND FUTURE WORK
 TRIP Algorithm for novel transcript reconstruction
– fragments length distribution
 Ongoing work
– Comparison to other transcriptome
reconstructions methods
– Comparison on real datasets
• Solid pe reads
• Illumina 100x2 pe reads
ACKNOWLEDGEMENTS
• NSF award IIS-0916401
• NSF award IIS-0916948
• Agriculture and Food Research Initiative Competitive
Grant no. 201167016-30331 from the USDA National
Institute of Food and Agriculture
• Second Century Initiative Bioinformatics University
Doctoral Fellowship, Georgia State University