Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].
Download ReportTranscript Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].
Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu
Computer Science & Engineering Department University of Connecticut [email protected]
Outline
• • • Intro to RNA-Seq – – Next-generation sequencing technologies RNA-Seq applications – Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq – – Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression – Tertiary analysis: functional annotation Conclusions
2
nd
Gen. Sequencing: Illumina
2
nd
Gen. Sequencing: Illumina
2
nd
Gen. Sequencing: ION Torrent
• • ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way • Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases
3
rd
Gen. Sequencing: PacBio SMRT
6
3
rd
Gen. Sequencing: PacBio SMRT
3
rd
Gen. Sequencing: Oxford Nanopore
http://www.technologyreview.com/article/427677/nanopore-sequencing/
Standard (Bulk) RNA-Seq
AAAAAA AAAAAA AAAAAA Reverse transcribe into cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Transcriptome reconstruction Gene expression quantification A A D B C E C Isoform expression quantification
Alternative Splicing
Pal S. et all , Genome Research, June 2011
Transcriptome Reconstruction
Common Approaches
• • • De novo (genome independent reconstruction) – Trinity, Oases, TransABySS • de Brujin k-mer graph Genome guided – Scripture • Reports “all” transcripts – Cufflinks, IsoLasso, SLIDE • Minimize set of transcripts explaining reads Annotation guided – RABT • Simulate reads from annotated transcripts
Genome-Guided Transcriptome Reconstruction – Multiple Solutions
1 2 3 4 5 6 7 t t 1 : 1 2 : 1 t 3 : 1 t 4 : 1 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 7 7 7 7
•
Which Solution is Most Likely?
TRIP: select smallest set of transcripts with good statistical fit between fragment length distribution
– empirically determined during library preparation – implied by “mapping” read pairs 500 1 200 2 200 3 200 300 1 200 3 200
TRIP Results
Sens
TP TP
FN PPV
TP TP
FP FScore
2
PPV PPV
Sens
Sens
• 100x coverage, 2x100bp pe reads; annotations for genes
Why Single Cell RNA-Seq?
Macaulay and Voet, PLOS Genetics, 2014
Challenges
•
Low RNA input + low RT efficiency
– Especially problematic for low expression genes
Macaulay and Voet, PLOS Genetics, 2014
Challenges
• •
Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity PCR amplification bias results in distortion of transcript abundances
SMARTer RNA-Seq Protocol
Correcting PCR Bias using UMIs (STRT-C1)
Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html
Outline
• • • Intro to RNA-Seq – RNA-Seq applications – Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq – Primary analysis: reads QC, mapping, and quantification – Secondary analysis: cells QC, normalization, clustering, and differential expression – Tertiary analysis: functional annotation Conclusions
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis 2,5 1 0,5 2 1,5 Lane 1 Lane 2 Lane 3 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73
Read position
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • Tools to analyze and preprocess fastq files FASTX ( http://hannonlab.cshl.edu/fastx_toolkit/ ) – Charts quality statistics – Filters sequences based on quality – Trims sequences based on quality – Collapses identical sequences into a single sequence PRINSEQ ( http://prinseq.sourceforge.net/ ) – Generates read length and quality statistics – Filters reads based on length, quality, GC content and other criteria – Trims reads based on length/position or quality scores
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis
RNA-Seq read mapping strategies:
– Ungapped mapping (with mismatches) to genome • Cannot align reads spanning exon-junctions – Local alignment (Smith-Waterman) to genome • Very slow – Spliced alignment to genome • Computationally harder than ungapped alignment, but much faster than local alignment – Mapping on transcript libraries • Fastest, but cannot align reads from un-annotated transcripts – Mapping on exon-exon junction libraries • Cannot align reads overlapping un-annotated exons – Hybrid approaches
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Comparison of spliced read mapping tools
Functional analysis Kim et al. http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis •
Cannot use raw read counts (why not?)
Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • • • CPM = count per million – Ignores multireads underestimates expression of genes in large families – Does not normalize for gene length cannot compare CPMs b/w genes – Comparing CPMs between samples assumes similar transcriptome size RPKM/FPKM = reads/fragments per kilobase per million – [Mortazavi et al. 08] Fractionally allocates multireads based on unique read estimates – – Length for multi-isoform genes?
Comparing FPKM between samples assumes similar (weighted) transcriptome size TPM: transcripts per million – Still relative measure of expression, but comparable between samples – Most accurate estimation methods use multireads and isoform level estimation UMI counts – Absolute measure of expression?
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene ambiguous reads A A B C C D E Isoform ambiguous reads
Expectation-maximization approach (IsoEM, RSEM)
w r
,
i
i j A A B C C
F a (i) F a (j)
Reads QC Read mapping Quantification
EM Algorithm
0.5
0.5
1 Cells QC 0.5
0.5
0.5
0.5
1 1 Normalization Clustering 0.5
2.5
0.5
1 1.5
Differential expression Functional analysis 0.2
0.2
0.2
0.2
0.2
1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute
expected
#reads for each transcript
Read mapping Quantification Reads QC
EM Algorithm
Cells QC Normalization Clustering 0.5
2.5
0.5
1 1.5
Differential expression Functional analysis 0.5/6 2.5/6 0.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute
expected
#reads for each transcript 4. Update transcript frequencies using
maximum likelihood
estimates
Read mapping Quantification Reads QC
EM Algorithm
Cells QC Normalization Clustering Differential expression Functional analysis 0.5/6 2.5/6 0.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute
expected
#reads for each transcript 4. Update transcript frequencies using
maximum likelihood
estimates 5.
Repeat steps 2-4 until convergence
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Detected genes/cell -- main population Detected genes/cell -- bi-modal distribution Detected genes/cell -- minor population
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Batch effects can be larger than biological effects, but can be corrected by normalization procedures CPM & TPM datasets pre-quantile normalization CPM & TPM datasets post-quantile normalization
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis •
Quantile normalization (Irizarry et al 2002)
Shifts CPM/FPKM/TPM values for each cell to match a reference distribution (e.g., distribution of means) - Highest value gets matched to highest value in reference - 2 nd highest gets mapped to 2 nd highest value in reference - And so on Distribution of TPMs Reference distribution
Reads QC Read mapping Quantification Cells QC Normalization
Principal Component Analysis
• Linear transformation of the data: Clustering Differential expression Functional analysis • – – 1 st component = direction of max. variance 2 nd component = orthogonal on 1 st , max. residual variance Used for dimensionality reduction (ignore high components) – – Visualization for exploratory analysis Feature selection
What makes a good clustering?
•
Homogeneity
: Elements within a cluster are close to each other •
Separation
: Elements in different clusters are further apart from each other Bad clustering Good clustering
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Many clustering algorithms!
Algorithm
K-means Fuzzy c-means Clustering (FCM) Hierarchical Clustering (HCS) EM Clustering SNN-Cliq
Parameters
K = Number of clusters K = number of clusters d = Degree of fuzziness Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single K = Number of clusters S = Number of initial seeds I = Number of iteration n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging. Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis
K-Means Clustering
• Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids
Reads QC Read mapping Quantification
K-Means Clustering
5 Cells QC Normalization Clustering 4 k 1 Differential expression Functional analysis 3 2 1 0 0 k 2 k 3 1 2 3
expression in condition 1
4 5
Reads QC Read mapping Quantification
K-Means Clustering
5 Cells QC Normalization Clustering 4 k 1 Differential expression Functional analysis 3 2 1 0 0 k 2 k 3 1 2 3
expression in condition 1
4 5
Reads QC Read mapping Quantification
K-Means Clustering
5 Cells QC Normalization Clustering k 1 4 Differential expression Functional analysis 3 2 1 k 2 k 3 0 0 1 2 3
expression in condition 1
4 5
Reads QC Read mapping Quantification
K-Means Clustering
5 Cells QC Normalization Clustering Differential expression Functional analysis k 1 4 3 2 1 0 0 k 2 k 3 1 2 3
expression in condition 1
4 5
Reads QC Read mapping
Rand Index (RI)
Quantification Cells QC Normalization RI= (TP+TN)/(TP+FP+FN+TN) Clustering Differential expression Functional analysis
Accuracy measures
Purity
𝑃𝑢𝑟𝑖𝑡𝑦 = 1 𝑁 𝑖 𝑣 𝑖 ∩ 𝑢 𝑗 U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset
Adjusted Rand Index (AR)
𝐴𝑅𝐼 = 𝑁 2 𝑇𝑃 + 𝑇𝑁 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)] 𝑁 2 2 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)]
F 1 Score
F 1 Score= 2 ×TP/(2×TP+FP+FN)
Mirkin’s index (MI) Hubert’s index (HI) Corr
It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering.
HI = RI – MI Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster
Reads QC Read mapping Quantification
Accuracy comparison
(Pollen et al. 2014, MiSeq) Cells QC Normalization Clustering Differential expression Functional analysis
Reads QC Read mapping Quantification
Accuracy comparison
(Pollen et al. 2014, HiSeq) Cells QC Normalization Clustering Differential expression Functional analysis
Reads QC Read mapping Quantification
Accuracy comparison
(Zeisel et al. 2015) Cells QC Normalization Clustering Differential expression Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tests for differential gene expression must take both fold change and statistical significance into account DE * * FC = 2 FC = 2 FC = 1.5
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • Many reliable DE methods for data with replicates – edgeR [Robinson et al., 2010] – DESeq [Anders et al., 2010] When no/few replicates available bootstrapping provides a robust alternative
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis
Gene expression table Enrichment Table Spindle Apoptosis 0.00001
0.00025
ENRICHMENT TEST
Experimental Data Interpretation & Hypotheses Gene-set Databases A priori knowledge + existing experimental data
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://david.abcc.ncifcrf.gov/
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://www.genemania.org/
Outline
• • • Intro to RNA-Seq – RNA-Seq applications – Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq – Primary analysis: reads QC, mapping, and quantification – Secondary analysis: cells QC, normalization, clustering, and differential expression – Tertiary analysis: functional annotation Conclusions
Conclusions
• • • • The range of single-cell applications continues to expand, fueled by advances in microfluidics technology and library prep protocols • ATAC-Seq, GT-Seq, Methyl-Seq, … Primary analysis is compute intensive • Requires server/cluster/cloud + linux + scripting • Galaxy framework ( https://usegalaxy.org/ ) provides web based interface to many tools Most secondary/tertiary analyses can be done on PC/Mac using R environment (some programming) • Many can be done using web-based tools and user-friendly apps (we’ll use JMP)
Conclusions
• • Development of single-cell specific analysis methods critical for fully realizing the potential of the technology • Allele specific expression • Biomarker selection • Cell type assignment • • Lineage reconstruction Characterization of heterogeneity Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies • Can also identify and characterize cell types missed by current capture protocols
Single cells or
AND
computational deconvolution
Acknowledgements
Sahar Al Seesi Marius Nicolae Elham Sherafat Craig Nelson Adrian Caciula Serghei Mangul Yvette Temate Tiagueu Alex Zelikovsky Edward Hemphill James Lindsay