Transcript Document
Statistical Issues in the Clustering of Gene Expression Data Darlene Goldstein, ISREC, EPFL 5 June 2003 Classification • Historically, objects are classified into groups – periodic table of the elements (chemistry) – taxonomy (zoology, botany) • Why classify? – organizational convenience, convenient summary – prediction – explanation • Note: these aims do not necessarily lead to the same classification; e.g. SIZE of object in hardware store vs. TYPE/USE of object Classification, cont • Classification divides objects into groups based on a set of values • Unlike a theory, a classification is neither true nor false, and should be judged largely on the usefulness of results (Everitt) • However, a classification (clustering) may be useful for suggesting a theory, which could then be tested Numerical methods • To provide objectivity (put in same objects to same methods, get out same classification – This is in contrast to experts deciding • To provide stability – Would like classification to be ‘robust’ to a wide variety of additions of objects, or characteristics Cluster analysis • Addresses the problem: Given n objects, each described by p variables (or features), derive a useful division into a number of classes • Usually want a partition of objects – But also ‘fuzzy clustering’ – Could also take an exploratory perspective • ‘Unsupervised learning’ Difficulties in defining ‘cluster’ Pre-processed cDNA Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing, Slides Genes 1 2 3 4 5 slide 1 slide 2 slide 3 slide 4 slide 5 … 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 1.51 0.06 0.10 -0.56 1.09 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... Gene expression level of gene 5 in slide 4 = Log2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale Clustering Gene Expression Data • Can cluster genes (rows), e.g. to (attempt to) identify groups of co-regulated genes • Can cluster samples (columns), e.g. to identify tumors based on profiles • Can cluster both rows and columns at the same time Clustering Gene Expression Data • Leads to readily interpretable figures • Can be helpful for identifying patterns in time or space • Useful (essential?) when seeking new subclasses of samples • Can be used for exploratory purposes Similarity • Similarity sij indicates the strength of relationship between two objects i and j • Usually 0 ≤ sij ≤1 • Correlation-based similarity ranges from –1 to 1 • Use of correlation-based similarity is quite common in gene expression studies but is in general contentious... Problems using correlation 3 objects 2 1 1 2 3 4 5 variables Dissimilarity and Distance • Associated with similarity measures sij bounded by 0 and 1 is a dissimilarity dij = 1 - sij • Distance measures have the metric property (dij +dik ≥ djk) • Many examples: Euclidean (‘as the crow flies’), Manhattan (‘city block’), etc. • Distance measure has a large effect on performance • Behavior of distance measure related to scale of measurement Partitioning Methods • Partition the objects into a prespecified number of groups K • Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares) • Examples: k-means, self-organizing maps (SOM), partitioning around medoids (PAM), model-based clustering Hierarchical Clustering • Produce a dendrogram • Avoid prespecification of the number of clusters K • The tree can be built in two distinct ways: – Bottom-up: agglomerative clustering – Top-down: divisive clustering Agglomerative Methods • Start with n mRNA sample (or G gene) clusters • At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters • Examples of between-cluster dissimilarities: – Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities – Single-link (NN): minimum of pairwise dissimilarities – Complete-link (FN): maximum of pairwise dissimilarities Divisive Methods • Start with only one cluster • At each step, split clusters into two parts • Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) • Disadvantage: Computational difficulties when considering all possible divisions into two groups Partitioning vs. Hierarchical • Partitioning – Advantage: Provides clusters that satisfy some optimality criterion (approximately) – Disadvantages: Need initial K, long computation time • Hierarchical – Advantage: Fast computation (agglomerative) – Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier Generic Clustering Tasks • Estimating number of clusters • Assigning each object to a cluster • Assessing strength/confidence of cluster assignments for individual objects • Assessing cluster homogeneity Bittner et al. It has been proposed (by many) that a cancer taxonomy can be identified from gene expression experiments. Dataset description • 31 melanomas (from a variety of tissues/cell lines) • 7 controls • 8150 cDNAs • 6971 unique genes • 3613 genes ‘strongly detected’ How many clusters are present? Average linkage, melanoma only 1-r = .54 unclustered ‘cluster’ Issues in Clustering • Pre-processing (Image analysis and Normalization) • Which genes (variables) are used • Which samples are used • Which distance measure is used • Which algorithm is applied • How to decide the number of clusters K Issues in Clustering • Pre-processing (Image analysis and Normalization) • Which genes (variables) are used • Which samples are used • Which distance measure is used • Which algorithm is applied • How to decide the number of clusters K Filtering Genes • All genes (i.e. don’t filter any) • At least k (or a proportion p) of the samples must have expression values larger than some specified amount, A • Genes showing ‘sufficient’ variation – a gap of size A in the central portion of the data – a interquartile range of at least B – ‘large’ SD, CV, ... Average linkage, top 300 genes in SD Average linkage, melanoma only unclustered ‘cluster’ Issues in Clustering • Pre-processing (Image analysis and Normalization) • Which genes (variables) are used • Which samples are used • Which distance measure is used • Which algorithm is applied • How to decide the number of clusters K Average linkage, melanoma only unclustered ‘cluster’ Average linkage, melanoma & controls unclustered ‘cluster’ control Issues in clustering • • • • • • Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K Complete linkage (FN) Single linkage (NN) Ward’s method (information loss) Issues in clustering • • • • • • Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K Divisive clustering, melanoma only Divisive clustering, melanoma & controls Partitioning methods K-means and PAM, 2 groups Bittner K-means PAM # samples 1 1 1 10 1 1 1 2 2 2 2 1 2 2 1 1 2 2 2 1 2 1 2 1 2 0 1 8 1 0 6 5 Bittner 1 K-means 1 PAM 1 # samples 11 1 1 2 1 2 1 2 2 1 2 6 1 2 2 2 4 2 2 2 3 3 2 3 3 2 3 3 1 3 3 1 1 2 4 1 3 3 3 3 3 Issues in clustering • • • • • • Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K How many clusters K? • Many suggestions for how to decide this! • Milligan and Cooper (Psychometrika 50:159179, 1985) studied 30 methods • A number of new methods, including GAP (Tibshirani ) and clest (Fridlyand and Dudoit) • Applying several methods yielded estimates of K = 2 (largest cluster has 27 members) to K = 8 (largest cluster has 19 members) Average linkage, melanoma only K=2 K=8 unclustered cluster Summary • Buyer beware – results of cluster analysis should be treated with GREAT CAUTION and ATTENTION TO SPECIFICS, because… • Many things can vary in a cluster analysis • If covariates/group labels are known, then clustering is usually inefficient Acknowledgements IPAM Group, UCLA: Debashis Ghosh Erin Conlon Dirk Holste Steve Horvath Lei Li Henry Lu Eduardo Neves Marcia Salzano Xianghong Zhao Others: Jose Correa Sandrine Dudoit Jane Fridlyand William Lemon Terry Speed Fred Wright