Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].

Transcript Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].

Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu

Computer Science & Engineering Department University of Connecticut [email protected]

Outline

• • • Intro to RNA-Seq – – Next-generation sequencing technologies RNA-Seq applications – Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq – – Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression – Tertiary analysis: functional annotation Conclusions

2 Gen. Sequencing: Illumina

2 Gen. Sequencing: ION Torrent

• • ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way • Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases

3 Gen. Sequencing: PacBio SMRT

3 Gen. Sequencing: Oxford Nanopore

http://www.technologyreview.com/article/427677/nanopore-sequencing/

Standard (Bulk) RNA-Seq

AAAAAA AAAAAA AAAAAA Reverse transcribe into cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Transcriptome reconstruction Gene expression quantification A A D B C E C Isoform expression quantification

Alternative Splicing

Pal S. et all , Genome Research, June 2011

Transcriptome Reconstruction

Common Approaches

• • • De novo (genome independent reconstruction) – Trinity, Oases, TransABySS • de Brujin k-mer graph Genome guided – Scripture • Reports “all” transcripts – Cufflinks, IsoLasso, SLIDE • Minimize set of transcripts explaining reads Annotation guided – RABT • Simulate reads from annotated transcripts

Genome-Guided Transcriptome Reconstruction – Multiple Solutions

1 2 3 4 5 6 7 t t 1 : 1 2 : 1 t 3 : 1 t 4 : 1 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 7 7 7 7

•

Which Solution is Most Likely?

TRIP: select smallest set of transcripts with good statistical fit between fragment length distribution

– empirically determined during library preparation – implied by “mapping” read pairs 500 1 200 2 200 3 200 300 1 200 3 200

TRIP Results

Sens



TP TP



FN PPV



TP TP



FP FScore

 2 

PPV PPV



Sens



Sens

• 100x coverage, 2x100bp pe reads; annotations for genes

Why Single Cell RNA-Seq?

Macaulay and Voet, PLOS Genetics, 2014

Challenges

•

Low RNA input + low RT efficiency

– Especially problematic for low expression genes

Macaulay and Voet, PLOS Genetics, 2014

Challenges

• •

Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity PCR amplification bias results in distortion of transcript abundances

SMARTer RNA-Seq Protocol

Correcting PCR Bias using UMIs (STRT-C1)

Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html

Outline

• • • Intro to RNA-Seq – RNA-Seq applications – Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq – Primary analysis: reads QC, mapping, and quantification – Secondary analysis: cells QC, normalization, clustering, and differential expression – Tertiary analysis: functional annotation Conclusions

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis 2,5 1 0,5 2 1,5 Lane 1 Lane 2 Lane 3 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73

Read position

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • Tools to analyze and preprocess fastq files FASTX ( http://hannonlab.cshl.edu/fastx_toolkit/ ) – Charts quality statistics – Filters sequences based on quality – Trims sequences based on quality – Collapses identical sequences into a single sequence PRINSEQ ( http://prinseq.sourceforge.net/ ) – Generates read length and quality statistics – Filters reads based on length, quality, GC content and other criteria – Trims reads based on length/position or quality scores

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

RNA-Seq read mapping strategies:

– Ungapped mapping (with mismatches) to genome • Cannot align reads spanning exon-junctions – Local alignment (Smith-Waterman) to genome • Very slow – Spliced alignment to genome • Computationally harder than ungapped alignment, but much faster than local alignment – Mapping on transcript libraries • Fastest, but cannot align reads from un-annotated transcripts – Mapping on exon-exon junction libraries • Cannot align reads overlapping un-annotated exons – Hybrid approaches

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression

Comparison of spliced read mapping tools

Functional analysis Kim et al. http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis •

Cannot use raw read counts (why not?)

Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • • • CPM = count per million – Ignores multireads  underestimates expression of genes in large families – Does not normalize for gene length  cannot compare CPMs b/w genes – Comparing CPMs between samples assumes similar transcriptome size RPKM/FPKM = reads/fragments per kilobase per million – [Mortazavi et al. 08] Fractionally allocates multireads based on unique read estimates – – Length for multi-isoform genes?

Comparing FPKM between samples assumes similar (weighted) transcriptome size TPM: transcripts per million – Still relative measure of expression, but comparable between samples – Most accurate estimation methods use multireads and isoform level estimation UMI counts – Absolute measure of expression?

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene ambiguous reads A A B C C D E Isoform ambiguous reads

Expectation-maximization approach (IsoEM, RSEM)

w r

i j A A B C C

F a (i) F a (j)

Reads QC Read mapping Quantification

EM Algorithm

0.5

1 Cells QC 0.5

0.5

1 1 Normalization Clustering 0.5

2.5

0.5

1 1.5

Differential expression Functional analysis 0.2

0.2

1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute

expected

#reads for each transcript

Read mapping Quantification Reads QC

EM Algorithm

Cells QC Normalization Clustering 0.5

2.5

0.5

1 1.5

Differential expression Functional analysis 0.5/6 2.5/6 0.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute

expected

#reads for each transcript 4. Update transcript frequencies using

maximum likelihood

estimates

Read mapping Quantification Reads QC

EM Algorithm

Cells QC Normalization Clustering Differential expression Functional analysis 0.5/6 2.5/6 0.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 3. Compute

expected

#reads for each transcript 4. Update transcript frequencies using

maximum likelihood

estimates 5.

Repeat steps 2-4 until convergence

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Detected genes/cell -- main population Detected genes/cell -- bi-modal distribution Detected genes/cell -- minor population

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Batch effects can be larger than biological effects, but can be corrected by normalization procedures CPM & TPM datasets pre-quantile normalization CPM & TPM datasets post-quantile normalization

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis •

Quantile normalization (Irizarry et al 2002)

Shifts CPM/FPKM/TPM values for each cell to match a reference distribution (e.g., distribution of means) - Highest value gets matched to highest value in reference - 2 nd highest gets mapped to 2 nd highest value in reference - And so on Distribution of TPMs Reference distribution

Reads QC Read mapping Quantification Cells QC Normalization

Principal Component Analysis

• Linear transformation of the data: Clustering Differential expression Functional analysis • – – 1 st component = direction of max. variance 2 nd component = orthogonal on 1 st , max. residual variance Used for dimensionality reduction (ignore high components) – – Visualization for exploratory analysis Feature selection

What makes a good clustering?

•

Homogeneity

: Elements within a cluster are close to each other •

Separation

: Elements in different clusters are further apart from each other Bad clustering Good clustering

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression

Many clustering algorithms!

Algorithm

K-means Fuzzy c-means Clustering (FCM) Hierarchical Clustering (HCS) EM Clustering SNN-Cliq

Parameters

K = Number of clusters K = number of clusters d = Degree of fuzziness Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single K = Number of clusters S = Number of initial seeds I = Number of iteration n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging. Functional analysis

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

K-Means Clustering

• Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids

Reads QC Read mapping Quantification

K-Means Clustering

5 Cells QC Normalization Clustering 4 k 1 Differential expression Functional analysis 3 2 1 0 0 k 2 k 3 1 2 3

expression in condition 1

4 5

Reads QC Read mapping Quantification

K-Means Clustering

5 Cells QC Normalization Clustering 4 k 1 Differential expression Functional analysis 3 2 1 0 0 k 2 k 3 1 2 3

expression in condition 1

4 5

Reads QC Read mapping Quantification

K-Means Clustering

5 Cells QC Normalization Clustering k 1 4 Differential expression Functional analysis 3 2 1 k 2 k 3 0 0 1 2 3

expression in condition 1

4 5

Reads QC Read mapping Quantification

K-Means Clustering

5 Cells QC Normalization Clustering Differential expression Functional analysis k 1 4 3 2 1 0 0 k 2 k 3 1 2 3

expression in condition 1

4 5

Reads QC Read mapping

Rand Index (RI)

Quantification Cells QC Normalization RI= (TP+TN)/(TP+FP+FN+TN) Clustering Differential expression Functional analysis

Accuracy measures

Purity

𝑃𝑢𝑟𝑖𝑡𝑦 = 1 𝑁 𝑖 𝑣 𝑖 ∩ 𝑢 𝑗 U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset

Adjusted Rand Index (AR)

𝐴𝑅𝐼 = 𝑁 2 𝑇𝑃 + 𝑇𝑁 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)] 𝑁 2 2 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)]

F 1 Score

F 1 Score= 2 ×TP/(2×TP+FP+FN)

Mirkin’s index (MI) Hubert’s index (HI) Corr

It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering.

HI = RI – MI Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster

Reads QC Read mapping Quantification

Accuracy comparison

(Pollen et al. 2014, MiSeq) Cells QC Normalization Clustering Differential expression Functional analysis

Reads QC Read mapping Quantification

Accuracy comparison

(Pollen et al. 2014, HiSeq) Cells QC Normalization Clustering Differential expression Functional analysis

Reads QC Read mapping Quantification

Accuracy comparison

(Zeisel et al. 2015) Cells QC Normalization Clustering Differential expression Functional analysis

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tests for differential gene expression must take both fold change and statistical significance into account DE * * FC = 2 FC = 2 FC = 1.5

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis • • Many reliable DE methods for data with replicates – edgeR [Robinson et al., 2010] – DESeq [Anders et al., 2010] When no/few replicates available bootstrapping provides a robust alternative

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

Gene expression table Enrichment Table Spindle Apoptosis 0.00001

0.00025

ENRICHMENT TEST

Experimental Data Interpretation & Hypotheses Gene-set Databases A priori knowledge + existing experimental data

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://david.abcc.ncifcrf.gov/

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis http://www.genemania.org/

Outline

Conclusions

• • • • The range of single-cell applications continues to expand, fueled by advances in microfluidics technology and library prep protocols • ATAC-Seq, GT-Seq, Methyl-Seq, … Primary analysis is compute intensive • Requires server/cluster/cloud + linux + scripting • Galaxy framework ( https://usegalaxy.org/ ) provides web based interface to many tools Most secondary/tertiary analyses can be done on PC/Mac using R environment (some programming) • Many can be done using web-based tools and user-friendly apps (we’ll use JMP)

Conclusions

• • Development of single-cell specific analysis methods critical for fully realizing the potential of the technology • Allele specific expression • Biomarker selection • Cell type assignment • • Lineage reconstruction Characterization of heterogeneity Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies • Can also identify and characterize cell types missed by current capture protocols

Single cells or

AND

computational deconvolution

Acknowledgements

Sahar Al Seesi Marius Nicolae Elham Sherafat Craig Nelson Adrian Caciula Serghei Mangul Yvette Temate Tiagueu Alex Zelikovsky Edward Hemphill James Lindsay

Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].

Transcript Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu Computer Science & Engineering Department University of Connecticut [email protected].

Computational Methods for Analysis of Single Cell RNA-Seq Data Ion Măndoiu

Outline

2

Gen. Sequencing: Illumina

2

Gen. Sequencing: Illumina

2

Gen. Sequencing: ION Torrent

3

Gen. Sequencing: PacBio SMRT

3

Gen. Sequencing: PacBio SMRT

3

Gen. Sequencing: Oxford Nanopore

Standard (Bulk) RNA-Seq

Alternative Splicing

Transcriptome Reconstruction

Common Approaches

Genome-Guided Transcriptome Reconstruction – Multiple Solutions

Which Solution is Most Likely?

TRIP: select smallest set of transcripts with good statistical fit between fragment length distribution

TRIP Results

Why Single Cell RNA-Seq?

Challenges

Low RNA input + low RT efficiency

Challenges

Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity PCR amplification bias results in distortion of transcript abundances

SMARTer RNA-Seq Protocol

Correcting PCR Bias using UMIs (STRT-C1)

Outline

RNA-Seq read mapping strategies:

Comparison of spliced read mapping tools

Cannot use raw read counts (why not?)

Expectation-maximization approach (IsoEM, RSEM)

EM Algorithm

EM Algorithm

EM Algorithm

Quantile normalization (Irizarry et al 2002)

Principal Component Analysis

What makes a good clustering?

Many clustering algorithms!

K-Means Clustering

K-Means Clustering

K-Means Clustering

K-Means Clustering

K-Means Clustering

Accuracy measures

Accuracy comparison

Accuracy comparison

Accuracy comparison

Outline

Conclusions

Conclusions

Single cells or

computational deconvolution

Acknowledgements

Directory