Transcript ppt

Inference of Allele Specific Isoform
Expression (ASIE) Levels from RNASeq Data
Sahar Al Seesi and Ion Măndoiu
Computer Science and Engineering
CANGS 2012
Outline
• Problem definition
• Challenges and limitations of current
approaches
• ASIE pipeline
– SNVQ
– RefHap
– Diploid IsoEM
• Results
Gene/Isoform Expression Estimation
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
A
Gene Expression (GE)
B
C
Isoform Expression (IE)
D
E
A
B
A
C
D
E
C
Allele Specific Gene/Isoform
Expression Estimation
H0
H1
Make cDNA &
shatter into fragments
Sequence fragment ends
H0
Map reads
H1
A
B
C
D
E
Allele Specific Gene Expression (GE)
H1
H0
H0
A
B
C
Allele Specific Isoform Expression (IE)
H1
H0
H0
H1
H1
D
H1
E
Challenges and limitations of current
approaches
• Need for diploid transcriptome
• Existing studies rely on simple alleles coverage
analysis for heterozygous SNP sites
– Not isoform specific
– Read mapping bias towards the reference allele
– Use less information  less robust estimates
Pipeline for ASIE from RNA-Seq Reads
Pipeline for ASIE from RNA-Seq Reads
Hybrid Approach Based on Merging
Alignments
mRNA
reads
Transcript
Library
Mapping
Transcript
mapped reads
Read
Merging
Genome
Mapping
Genome
mapped reads
Mapped
reads
Merging Rules for Short Reads
Genome
Transcripts
Agree?
Hard Merge
Unique
Unique
Yes
Keep
Unique
Unique
No
Throw
Unique
Multiple
No
Throw
Unique
Not Mapped
No
Keep
Multiple
Unique
No
Throw
Multiple
Multiple
No
Throw
Multiple
Not Mapped
No
Throw
Not mapped
Unique
No
Keep
Not mapped
Multiple
No
Throw
Not mapped
Not Mapped
Yes
Throw
Merging Local Alignments of ION Reads:
HardMerge at Base-Level
• Input: SAM files with alignments from genome and transcriptome
mapping
• The following alignments are filtered out
– Any local alignments of length <= 15 bases
– All alignments of read that has alignments on different chromosomes
or different strands
• Key idea: a read base mapped to multiple locations is discarded
• Output alignments are generated from contiguous stretches of nonambiguously mapped bases, based on the unique genomic location
of these bases
– Subject to the above filtering criteria
HardMerge Example
Input alignments in genome coordinates:
Filter multiple local alignments/sub-alignments
Output alignment:
SNV Detection and Genotyping
J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards
Accurate Detection and Genotyping of Expressed Variants from
Whole Transcriptome Sequencing Data, BMC Genomics
13(Suppl 2):S6, 2012
• A reliable hybrid mapping strategy
• Bayesian model for SNV detection based on
quality scores
SNVQ Model
• Calculate conditional probabilities by multiplying
contributions of individual reads
Accuracy per Coverage Bins
Pipeline for ASIE from RNA-Seq Reads
ReFHap
J. Duitama and T. Huebsch and G. McEwen and E. Suk and M.R.
Hoehe, ReFHap: A Reliable and Fast Algorithm for Single
Individual Haplotyping, Proc. 1st ACM Intl. Conf. on
Bioinformatics and Computational Biology, pp. 160-169, 2010
• Problem Formulation
– Alleles for each locus are encoded with 0 and 1
– Fragment: Aligned read showing coocurrance of two or
more alleles in the same chromosome copy
Locus
1
2
3
4
5
6
7
8
9
...
f
-
0
1
1
-
1
-
0
0
...
Problem Formulation
• Input: Matrix M of m fragments covering n loci
Locus 1
2
3
4
5
...
n
f1
1
1
0
-
1
-
f2
-
0
1
0
0
1
f3
-
0
0
0
1
-
-
-
-
-
1
0
...
fm
ReFHap vs HapCUT
Pipeline for ASIE from RNA-Seq Reads
IsoEM: Isoform Expression Level Estimation
• Expectation-Maximization algorithm
• Unified probabilistic model incorporating
–
–
–
–
–
Single and/or paired reads
Fragment length distribution
Strand information
Base quality scores
Repeat and hexamer bias correction
Read-isoform compatibility
wr ,i
wr ,i   OaQa Fa
a
Fragment length distribution
• Paired reads
Fa(i)
i
A
B
j
A
C
C
Fa (j)
IsoEM vs. Cufflinks 1.0.3 on ION reads
R2 for IsoEM/Ccufflinks Estimates vs qPCR
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
IsoEM HBR
Cufflinks HBR
IsoEM UHR
Cufflinks UHR
Simplified Pipeline for ASIE in F1
Hybrids
Short Reads
Reference Transcriptome
>name:EI1W3PE02ILQXT
GAATTCTGTGAAAGCCTGTAGCTATAA
>name:EI1W3PE02ILQXA
AAAAATGTTGAGCCATAAATACCATCA
>name:EI1W3PE02ILQXB
CTTTGAAGTATTCTGAGACTTGTAGGA
>name:EI1W3PE02ILQXC
AGGTGAAGTAAATATCTAATATAATTG
>name:EI1W3PE02ILQXD
GATTGTATGTTTTTGATTATTTTTTGTTA
>name:EI1W3PE02ILQXE
GGCTGTGATGGGCTCAAGTAATTGAAA
>name:EI1W3PE02ILQXF
AATACAGATGGATTCAGGAGAGGTAC
>name:EI1W3PE02ILQXG
TTCCAGGGGGTCAAGGGGAGAAATAC
>name:EI1W3PE02ILQXH
CTCCTAATTCTGGAGTAGGGGCTAGGC
Diploid
Transcriptome
Prental Genome
Sequences
A
B
C
A
B
C
AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC
A
>chrX
>chrX
GAATTCTGTGAAAGCCTGT
GAATTCTGTGAAAGCCTGT
AGCTATAAAAAAATGTTGA
AGCTATAAAAAAATGTTGA
GCCATAAATACCATCACTTT
GCCATAAATACCATCACTTT
GAAGTATTCTGAGACTTGT
GAAGTATTCTGAGACTTGT
AGGAAGGTGAAGTAAATA
AGGAAGGTGAAGTAAATA
TCTAATATAATTGGATTGTA
TCTAATATAATTGGATTGTA
TGTTTTTGATTATTTTTTGTT
TGTTTTTGATTATTTTTTGTT
AGGCTGTGATGGGCTCAA
AGGCTGTGATGGGCTCAA
GTAATTGAAA
GTAATTGAAA
Generate
Isoform
Sequences
B
C
Align to
Diploid
Transcriptome
AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC
A
C
AAAAATGTTGAGCCTTTGAAGTATTC
A
C
AAAAATGTTGAGCCTTTGAAGTATTC
Allele Specific
Expression Levels
A
B
C
A
C
A
B
C
A
C
IsoEM
ABC
AC
Allele Specific
Read Mapping
Whole Brain RNA-Seq Data - Sanger
Institute Mouse Genomes Project
Strain
C57BL
BALBc
A/J
CAST
SPRET
SNPs
9,844
3,920,925
4,198,324
17,673,726
35,441,735
Private SNPs
1,488
29,973
44,837
5,368,019
23,455,525
Number of read
Number of mapped Percentage of
pairs
read pairs
mapped Pairs
C57BL
57,187,342
21,756,070
38.044
BALBc
62,465,347
28,358,653
45.399
A/J
46,993,887
22,449,227
47.771
CAST
54,569,423
22,307,194
40.879
SPRET
57,411,555
19,016,949
33.124
C57BLxBALBc
114,374,684
47,682,108
41.689
C57BLxAJ
93,987,774
35,353,398
37.615
C57BLxCAST
109,138,846
43,134,951
39.523
C57BLxSPRET
114,374,684
40,780,806
35.655
Strain/Hybrid
Hybrid
C57BL IE
Strain IE
C57BL GE
Strain GE
C57BLxStrain
Pearson
Pearson
Pearson
Pearson
C57BLxSPRET
0.952
0.726
0.951
0.725
C57BLxBALBc
0.705
0.675
0.706
0.675
C57BLxAJ
0.855
0.902
0.856
0.903
C57BLxCAST
0.872
0.824
0.924
0.882
C57BLxSPRET
0.952
0.726
0.951
0.725
Correlation between FPKM values, for each strain, inferred
from the separate strain RNA-Seq Read vs. the pooled read
of the two strains (synthetic hybrid)
Allele Specific Isoform Expression for
Synthetic Hybrid C57BLxAJ
R2 = 0.73
R2 = 0.81
Correlation between FPKM values, for each strain, inferred from the separate
strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)
Allele Specific Isoform Expression for
Synthetic Hybrid C57BLxCAST
R2 = 0.76
R2 = 0.68
Correlation between FPKM values, for each strain, inferred from the separate
strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)
Allele Specific Expression
on Drosophila RNA-Seq data from
[McManus et al. 10]
0.0000001
0.00001
0.001
D.Mel. In Parental Pool
R² = 0.8922
0.1
1E-09
0.0000001
0.00001
0.1
0.001
0.00001
1E-09
0.1
0.1
0.001
1E-05
0.0000001
D.Mel.
0.001
R² = 0.9333
D.Sec.in Parental Pool
1E-09
1E-07
D.Sec.
1E-09
Allele Specific Expression for Mouse
RNA-Seq Data from [Gregg et al. 2010]
Conclusion
• Proposed novel RNA-Seq analysis pipeline
– Reconstructs diploid transcriptome
– Not affected by mapping bias towards reference
allele
– Estimation of allele specific expression levels of
isoforms
– Robust estimation based on all reads
What’s Next?
• Test whole pipeline
• Use read coverage information SNVs along
with max cut sizes in RefHap to phase isolated
SNPs
• Incorporate flowgram data, when available, in
SNV detection
• Deploy on Galaxy
• Develop ASIE plugin for ION Torrent
Acknowledgments
• Ion Mandoiu (Uconn)
• Jorge Duitama (KU Leuven)
• Marius Nicolae (Uconn)
•
•
•
•
•
Alex Zelikovsky (GSU)
Serghei Mangul (GSU)
Adrian Caciula (GSU)
Dumitru Brinza (Life Tech)
Pramod Srivastava (UCHC)