Transcript Document

BINF636
Clustering and Classification
Jeff Solka Ph.D.
Fall 2008
BINF636 CLUSTERING AND CLASSIFICATION
1
Gene Expression Data
X GxI
 x11
x
21




 xG1
samples
x12
x22
xG 2
x1I 

x2 I 


xGI 
Genes
xgi = expression for gene g in sample i
BINF636 CLUSTERING AND CLASSIFICATION
2
The Pervasive Notion of Distance
• We have to be able to measure similarity or
dissimilarity in order to perform clustering,
dimensionality reduction, visualization, and
discriminant analysis.
• How we measure distance can have a profound
effect on the performance of these algorithms.
BINF636 CLUSTERING AND CLASSIFICATION
3
Distance Measures and Clustering
• Most of the common clustering methods such as
k-means, partitioning around medoid (PAM) and
hierarchical clustering are dependent on the
calculation of distance or an interpoint distance
matrix.
• Some clustering methods such as those based
on spectral decomposition have a less clear
dependence on the distance measure.
BINF636 CLUSTERING AND CLASSIFICATION
4
Distance Measures and
Discriminant Analysis
• Many supervised learning procedures (a.k.a.
discriminant analysis procedures) also depend on
the concept of a distance.
– nearest neighbors
– k-nearest neighbors
– Mixture-models
BINF636 CLUSTERING AND CLASSIFICATION
5
BINF636 CLUSTERING AND CLASSIFICATION
6
BINF636 CLUSTERING AND CLASSIFICATION
7
BINF636 CLUSTERING AND CLASSIFICATION
8
Two Main Classes of Distances
• Consider two gene expression profiles as
expressed across I samples. Each of these can
be considered as points in RI space. We can
calculate the distance between these two
points.
• Alternatively we can view the gene expression
profiles as being manifestations of samples
from two different probability distributions.
BINF636 CLUSTERING AND CLASSIFICATION
9
BINF636 CLUSTERING AND CLASSIFICATION
10
A General Framework for
Distances Between Points
• Consider two m-vectors x = (x1, …, xm) and y = (y1,
…, ym). Define a generalized distance of the form
d  x, y   F  d1  x1 , y1  ,
, d m  xm , ym  
where the d k are themselves distances for each of the k  1,
, m features.
• We call this a pairwise distance function as the
pairing of features within observations is preserved.
BINF636 CLUSTERING AND CLASSIFICATION
11
Minkowski Metric
• Special case of our generalized metric
d  x, y   F  d1  x1 , y1  ,
, d m  xm , ym  
1/ 

 
zk  d k  xk , yk   xk  yk and F  z1 , , zm     zk 
 k 1 
Manhatten metric,   1, Euclidean metric,   2
m
BINF636 CLUSTERING AND CLASSIFICATION
12
Euclidean and Manhattan Metric
Euclidean Metric
d euc  x, y  
m
x  y 
i 1
i
2
i
Manhattan Metric
m
d man  x, y    xi  yi
i 1
BINF636 CLUSTERING AND CLASSIFICATION
13
Correlation-based Distance
Measures
• Championed for use within
the microarray literature
by Eisen.
• Types
– Pearson’s sample
correlation distance.
– Eisen’s cosine
correlation distance.
– Spearman sample
correlation distance.
– Kendall’s t sample
correlation.
BINF636 CLUSTERING AND CLASSIFICATION
14
Pearson Sample Correlation
Distance (COR)
m
dcor  x, y   1  r  x, y   1 
  x  x  y  y 
i
i 1
m
i
m
 x  x   y  y 
i 1
2
i
i 1
2
i
where x and y are the mean coordinate for the x and y respectively.
BINF636 CLUSTERING AND CLASSIFICATION
15
Eisen Cosine Correlation Distance
(EISEN)
m
x' y
d eisen  x, y   1 
 1
x y
x y
i
i 1
m
x
i 1
2
i
i
m
2
y
 i
i 1
a special case of the Pearson correlation with x=y=0
BINF636 CLUSTERING AND CLASSIFICATION
16
Spearman Sample Correlation
Distance (SPEAR)
'
'
'
'
x

x
y

y
  i  i 
m
d spear  x, y   1 
i 1
x  x   y  y 
m
i 1
' 2
'
i
'
i
' 2
where x  rank  xi  and y  rank  yi 
'
i
'
i
BINF636 CLUSTERING AND CLASSIFICATION
17
Tau Kendall’s t Sample
Correlation (TAU)
m
dtau  x, y   1  t  x, y   1 
m
 C
i 1 j 1
xij
C yij
m  m  1
where Cxij  sign  xi  x j  and C yij  sign  yi  y j 
BINF636 CLUSTERING AND CLASSIFICATION
18
Some Observations - I
• Since we are subtracting the correlation
measures from 1, things that are perfectly
positively correlated (correlation measure of 1)
will have a distance close to 0 and things that
are perfectly negatively correlated (correlation
measure of -1) will have a distance close to 2.
• Correlation measures in general are invariant to
location and scale transformations and tend to
group together genes whose expression values
are linearly related.
BINF636 CLUSTERING AND CLASSIFICATION
19
Some Observations - II
• The parametric methods (COR and EISEN) tend to be
more negatively effected by the presence of outliers
than the non-parametric methods (SPEAR and TAU)
• Under the assumption that we have standardized the
data so that x and y are m-vectors with zero mean
and unit length then there is a simple relationship
between the Pearson correlation coefficient r(x,y) and
the Euclidean distance
deuc  x, y   2m 1  r  x, y  
BINF636 CLUSTERING AND CLASSIFICATION
20
Mahalanobis Distance
 x  y '  x  y
1
• This allows data directional variability to come into
play when calculating distances.
• How do we estimate ?
BINF636 CLUSTERING AND CLASSIFICATION
21
Distances and Transformations
• Assume that g is an invertible possible non-linear
transformation g: x  x’
• This transformation induces a new metric d’ via
d  x, y   d  g 1  x ' , g 1  y '    d '  x ', y ' 
BINF636 CLUSTERING AND CLASSIFICATION
22
Distances and Scales
• Original scanned fluorescence intensities
• Logarithmically transformed data
• Data transformed by the general logarithm
BINF636 CLUSTERING AND CLASSIFICATION
23
Experiment-specific Distances
Between Genes
• One might like to use additional experimental
design information in deterring how one
calculates distances between the genes.
• One might wish to used smoothed estimates or
other sorts of statistical fits and measure
distances between these.
• In time course data distances that honor the
time order of the data are appropriate.
BINF636 CLUSTERING AND CLASSIFICATION
24
Standardizing Genes
xgi 
xgi  center  xg . 
scale  xg . 
center  xg .   measure of the center of the distribution
of the set of values xgi , i  1,
, I (mean, median)
scale  xg   measure of scale (standard deviation, interquartile range,
MAD)
BINF636 CLUSTERING AND CLASSIFICATION
25
Standardizing Arrays (Samples)
xgi 
xgi  center  x.i 
scale  x.i 
center  x.i   measure of the center of the distribution
of the set of values xgi , i  1,
, G (mean, median)
scale  x.i   measure of scale (standard deviation, interquartile range,
MAD)
BINF636 CLUSTERING AND CLASSIFICATION
26
Scaling and Its Implication to
Data Analysis - I
• Types of gene expression data
– Relative (cDNA)
– Absolute (Affymetrix)
• xgi is the expression of gene g on sample I as
measured on a log scale
• Let ygi = xgi – xgA; patient A is our reference
• The distance between patient samples
d ( y.i , y. j )   d g  y gi , y gj    d g  xgi  xgA , y gj  xgA 
G
G
g 1
g 1
where the sum of course is over all genes and xgA is the expression of
gene g on patient A.
BINF636 CLUSTERING AND CLASSIFICATION
27
Scaling and Its Implication to
Data Analysis - II
If d ( x, y) are functions of x  y alone, then d ( y.i , y. j )  d ( x.i , x. j )
and it does not matter if we look at relative (the y ' s) or absolute (the x ' s)
expression measures.
BINF636 CLUSTERING AND CLASSIFICATION
28
Scaling and Its Implication to
Data Analysis - III
The distance between two genes are given by
d ( yg . , yh. )   di  y gi , yhi    di  xgi  xgA , xhi  xhA 
I
I
i 1
i 1
where xgA is the expression of gene g on patient A and
xhA is the expression for gene h on patient A.
If d ( x, y ) has the property that d ( x  c, y )  d ( x, y ) for any c, then the distance measure is the same
for the absolute and relative expression measures.
BINF636 CLUSTERING AND CLASSIFICATION
29
Summary of Effects of Scaling
on Distance Measures
• Minkowski distances
– Distance between samples is the same for
relative and absolute measures
– Distance between genes is not the same for
relative and absolute measures
• Pearson correlation-based distance
– Distances between genes is the same for
relative and absolute measures
– Distances between samples is not the same
for relative and absolute measures
BINF636 CLUSTERING AND CLASSIFICATION
30
What is Cluster Analysis?
• Given a collection of n objects each of which is
described by a set of p characteristics or
variables derive a useful division into a number
of classes.
• Both the number of classes and the properties
of the classes are to be determined.
(Everitt 1993)
BINF636 CLUSTERING AND CLASSIFICATION
31
Why Do This?
• Organize
• Prediction
• Etiology (Causes)
BINF636 CLUSTERING AND CLASSIFICATION
32
How Do We Measure Quality?
• Multiple Clusters
– Male, Female
– Low, Middle, Upper Income
• Neither True Nor False
• Measured by Utility
BINF636 CLUSTERING AND CLASSIFICATION
33
Difficulties In Clustering
•
Cluster structure may be manifest in a multitude of ways
•
Large data sets and high dimensionality complicate matters
BINF636 CLUSTERING AND CLASSIFICATION
34
Clustering Prerequisites
• Method to measure the distance between observations
and clusters
– Similarity
– Dissimilarity
– This was discussed previously
• Method of normalizing the data
– We discussed this previously
• Method of reducing the dimensionality of the data
– We discussed this previously
BINF636 CLUSTERING AND CLASSIFICATION
35
The Number of Groups Problem
•
•
How Do We Decide on the Appropriate Number of Clusters?
Duda, Hart and Stork (2001)
– Form Je(2)/Je(1) where Je(M) is the sum of squares error
criterion for the m cluster model. The distribution of this
ratio is usually not known.
J e 1   x  m
2
xD
2
J e 2   x  mi
2
i 1 xD
BINF636 CLUSTERING AND CLASSIFICATION
36
Optimization Methods
• Minimizing or Maximizing Some Criteria
• Does Not Necessarily Form Hierarchical
Clusters
BINF636 CLUSTERING AND CLASSIFICATION
37
Clustering Criteria
The Sum of Squared Error Criteria
1
mi 
ni
x
xDi
c
J e    x  mi
2
i 1 xDi
BINF636 CLUSTERING AND CLASSIFICATION
38
Spoofing of the Sum of Squares
Error Criterion
BINF636 CLUSTERING AND CLASSIFICATION
39
Related Criteria
•
With a little manipulation we obtain
c
1
J e   ni si
2 i 1
1
si  2
ni

xDi
 x  x'
2
x 'Di
•
Instead of using average squared distances betweens points in a
cluster as indicated above we could use perhaps the median or
maximum distance
•
Each of these will produce its own variant
BINF636 CLUSTERING AND CLASSIFICATION
40
Scatter Criteria
mi   xD x
i
1
m
n

D
x
Si  xD ( x  mi )(x  mi )
t
i
c
SW   Si
i 1
c
S B   ni mi  mmi  m
t
i 1
ST   x  mx  m
t
xD
ST  SW  SB
BINF636 CLUSTERING AND CLASSIFICATION
41
Relationship of the Scattering
Criteria
SW 
SB 
BINF636 CLUSTERING AND CLASSIFICATION
42
Measuring the Size of Matrices
• So we wish to minimize SW while maximizing SB
• We will measure the size of a matrix by using
its trace of determinant
• These are equivalent in the case of univariate
data
BINF636 CLUSTERING AND CLASSIFICATION
43
Interpreting the Trace Criteria
c
tr SW   i 1 tr Si     x  mi
c
i 1 xDi
2
 Je
tr ST   tr SW   tr S B 
BINF636 CLUSTERING AND CLASSIFICATION
44
The Determinant Criteria
• SB will be singular if the number of clusters is
less than or equal to the dimensionality
J d  SW 
c
S
i 1
i
• Partitions based on Je may change under linear
transformations of the data
• This is not the case with Jd
BINF636 CLUSTERING AND CLASSIFICATION
45
Other Invariant Criteria
•
It can be shown that the eigenvalues of SW-1SB are invariant under
nonsingular linear transformation
•
We might choose to maximize


d
tr S S   i
SW
ST
1
W B
i 1
d
1

i 1 1  i
BINF636 CLUSTERING AND CLASSIFICATION
46
k-means Clustering
1. Begin initialize n, k, m1, m2, …, mk
2. Do classify n samples according to nearest mi
Recompute mi
3. Until no no change in mi
4. Return m1, m2, .., mk
5. End
• Complexity of the algorithm is O(ndkT)
– T is the number of iterations
– T is typically << n
BINF636 CLUSTERING AND CLASSIFICATION
47
Example Mean Trajectories
BINF636 CLUSTERING AND CLASSIFICATION
48
Optimizing the Clustering
Criterion
• N(n,g) = The number of partitions of n individuals into g
groups
N(15,3)=2,375,101
N(20,4)=45,232,115,901
N(25,8)=690,223,721,118,368,580
N(100,5)=1068
Note that the 3.15 x 10
universe in seconds
17
is the estimated age of the
BINF636 CLUSTERING AND CLASSIFICATION
49
Hill Climbing Algorithms
1 - Form initial partition into required number of groups
2 - Calculate change in clustering criteria produced by
moving each individual from its own to another cluster.
3 - Make the change which leads to the greatest
improvement in the value of the clustering criterion.
4 - Repeat steps (2) and (3) until no move of a single
individual causes the clustering criterion to improve.
• Guarantees local not global optimum
BINF636 CLUSTERING AND CLASSIFICATION
50
How Do We Choose c
• Randomly “classify” points to generate the mi’s
• Randomly generate mi’s
• Base location of the c solution on the c-1
solution
• Base location of the c solution on a hierarchical
solution
BINF636 CLUSTERING AND CLASSIFICATION
51
Alternative Methods
• Simulated Annealing
• Genetic Algorithms
• Quantum Computing
BINF636 CLUSTERING AND CLASSIFICATION
52
Hierarchical Cluster Analysis
• 1 Cluster to n Clusters
• Agglomerative Methods
– Fusion of n Data Points into Groups
• Divisive Methods
– Separate the n Data Points Into Finer
Groupings
BINF636 CLUSTERING AND CLASSIFICATION
53
Dendrograms
agglomerative
0
1
(1)
(2)
(3)
(4)
(1,2)
4
divisive
3
2
3
4
(1,2,3,4,5)
(3,4,5)
(4,5)
2
1
BINF636 CLUSTERING AND CLASSIFICATION
0
54
Agglomerative Algorithm
(Bottom Up or Clumping)
Start: Clusters C1, C2, ..., Cn each with 1
data point
1 - Find nearest pair Ci, Cj, merge Ci and Cj,
delete Cj, and decrement cluster count by 1
If number of clusters is greater than 1 then
go back to step 1
BINF636 CLUSTERING AND CLASSIFICATION
55
Inter-cluster Dissimilarity
Choices
• Furthest Neighbor (Complete Linkage)
• Nearest Neighbor (Single Linkage)
• Group Average
BINF636 CLUSTERING AND CLASSIFICATION
56
Single Linkage
(Nearest Neighbor) Clustering
• Distance Between Groups is Defined as That of
the Closest Pair of Individuals Where We
Consider 1 Individual From Each Group
• This method may be adequate when the clusters
are fairly well separated Gaussians but it is
subject to problems with chaining
BINF636 CLUSTERING AND CLASSIFICATION
57
Example of Single Linkage
Clustering
1
1
2
3
4
5
2
0.0
2.0 0.0
6.0 5.0
10.0 9.0
9.0 8.0
(1 2)
3
4
5
(1 2)
0.0
5.0
9.0
8.0
3
4
5
0.0
4.0
5.0
0.0
3.0
0.0
3
4
5
0.0
4.0
5.0
0.0
3.0
0.0
BINF636 CLUSTERING AND CLASSIFICATION
58
Complete Linkage Clustering
(Furthest Neighbor)
• Distance Between Groups is Defined as Most
Distance Pairs of Individuals
BINF636 CLUSTERING AND CLASSIFICATION
59
Complete Linkage Example
1
1
2
3
4
5
2
0.0
2.0 0.0
6.0
5.0
10.0 9.0
9.0
8.0
3
4
5
0.0
4.0
5.0
0.0
3.0
0.0
(1,2) is the First Cluster
d(12) 3 = max[d13,d23]=d13=6.0
d(12)4 = max[d14,d24]=d14=10.0
d(12)5 = max[d15,d25]=d15=9.0
So the cluster consisting of (12) will be merged with the
cluster consisting of (3)
BINF636 CLUSTERING AND CLASSIFICATION
60
Group Average Clustering
• Distance between clusters is the average of the
distance between all pairs of individuals
between the 2 groups
• A compromise between single linkage and
complete linkage
BINF636 CLUSTERING AND CLASSIFICATION
61
Centroid Clusters
• We use centroid of a group once it is formed.
3
2
1
45
3
1
1
45
3
1
1
22
BINF636 CLUSTERING AND CLASSIFICATION
2
1
1
22
62
Problems With Hierarchical
Clustering
• Well it really gives us a continuum of different
clusterings of the data
• As stated previously there are specific
artifacts of the various methods
BINF636 CLUSTERING AND CLASSIFICATION
63
Dendrogram
BINF636 CLUSTERING AND CLASSIFICATION
64
Data Color Histogram or Data
Image
Orderings of the
data matrix were
first discussed in
Bertin. Wegman in
1990 coined the
term “data color
histogram.” Mike
Minnotte and
Webster West
subsequently
termed the phrase
“data image” in
1998.
BINF636 CLUSTERING AND CLASSIFICATION
65
Data Image Reveals Obfuscated
Cluster Structure
Subset of the pairs plot
Sorted on Observations
Sorted on Observations and
Features
90 observations in R100 drawn from a standard normal distribution
The first and second 30 rows were shifted by 20 in their first and second
dimensions respectively. This data matrix was then multiplied by a
100 x 100 matrix of Gaussian noise.
BINF636 CLUSTERING AND CLASSIFICATION
66
The Data Image in the Gene Expression
Community
• Extracted from
BINF636 CLUSTERING AND CLASSIFICATION
67
Example Dataset
BINF636 CLUSTERING AND CLASSIFICATION
68
Complete Linkage Clustering
BINF636 CLUSTERING AND CLASSIFICATION
69
Single Linkage Clustering
BINF636 CLUSTERING AND CLASSIFICATION
70
Average Linkage Clustering
BINF636 CLUSTERING AND CLASSIFICATION
71
Pruning Our Tree
cutree(tree, k = NULL, h = NULL)
tree
a tree as produced by hclust. cutree() only
expects a list with components merge, height,
and labels, of appropriate content each.
k
an integer scalar or vector with the desired number of
groups
h
numeric scalar or vector with heights where the tree
should be cut.
At least one of k or h must be specified, k overrides h if
both are given.
Values Returned
cutree
returns a vector with group memberships if k or h are scalar,
otherwise a matrix with group meberships is returned where
each column corresponds to the elements of k or h,
respectively (which are also used as column names).
BINF636 CLUSTERING AND CLASSIFICATION
72
Example Pruning
> x.cl2<-cutree(hclust(x.dist),k=2)
> x.cl2[1:10]
[1] 1 1 1 1 1 1 1 1 1 1
> x.cl2[190:200]
[1] 2 2 2 2 2 2 2 2 2 2 2
BINF636 CLUSTERING AND CLASSIFICATION
73
Identifying the Number of
Clusters
• As indicated previously we really have no way of
identify the true cluster structure unless we
have divine intervention
• In the next several slides we present some
well-known methods
BINF636 CLUSTERING AND CLASSIFICATION
74
Method of Mojena
• Select the number of groups based on the first stage of
the dendogram that satisfies
a j 1  a  ksa
• The a0,a1,a2,... an-1 are the fusion levels corresponding
to stages with n, n-1, …,1 clusters. a and
are the
a
mean and unbiased standard deviation of these fusion
levels and k is a constant.
s
• Mojena (1977) 2.75 < k < 3.5
• Milligan and Cooper (1985) k=1.25
BINF636 CLUSTERING AND CLASSIFICATION
75
Hartigan’s k-means theory
When deciding on the number of clusters,
Hartigan (1975, pp 90-91) suggests the
following rough rule of thumb. If k is the
result of kmeans with k groups and kplus1 is
the result with k+1 groups, then it is
justifiable to add the extra group when
(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k-1)
is greater than 10.
BINF636 CLUSTERING AND CLASSIFICATION
76
kmeans Applied to our Data Set
BINF636 CLUSTERING AND CLASSIFICATION
77
The 3 term kmeans solution
BINF636 CLUSTERING AND CLASSIFICATION
78
The 4 term kmeans Solution
BINF636 CLUSTERING AND CLASSIFICATION
79
Determination of the Number of Clusters Using the
Hartigan Criteria
BINF636 CLUSTERING AND CLASSIFICATION
80
MIXTURE-BASED CLUSTERING
g
f (x )    i f i (x , )
i 1
f i (x , )  N (m i ,  i )
BINF636 CLUSTERING AND CLASSIFICATION
81
HOW DO WE CHOOSE g?
• Human Intervention
• Divine Intervention
• Likelihood Ratio Test Statistic
– Wolfe’s Method
– Bootstrap
– AIC,BIC, MDL
• Adaptive Mixtures Based Methods
– Pruning
– SHIP (AKMM)
BINF636 CLUSTERING AND CLASSIFICATION
82
Akaike's Information criteria
(AIC)
• AIC(g) = -2L(f) + N(g) where N(g) is the
number of free parameters in the model of size
g.
• We Choose g In Order to Minimize the AIC
Condition
• This Criterion is Subject to the Same
Regularity Conditions as -2log
BINF636 CLUSTERING AND CLASSIFICATION
83
MIXTURE VISUALIZATION 2-d
BINF636 CLUSTERING AND CLASSIFICATION
84
MODEL-BASED CLUSTERING
• This technique takes a density function
approach.
• Uses finite mixture densities as models for
cluster analysis.
• Each component density characterizes a cluster.
BINF636 CLUSTERING AND CLASSIFICATION
85
Minimal Spanning Tree-Based
Clustering
Diansheng Guo Donna Peuquet Mark Gahegan, (2002) , Opening the black
box: interactive hierarchical clustering for multivariate spatial patterns, Geographic
Information Systems Proceedings of the tenth ACM international symposium on Advances in
geographic information systems, McLean, Virginia, USA
BINF636 CLUSTERING AND CLASSIFICATION
86
What is Pattern Recognition?
• From Devroye, Györfi and Lugosi:
– Pattern recognition or discrimination is about
guessing or predicting the unknown nature of
an observation, a discrete quantity such as
black or white, one or zero, sick or healthy,
real or fake.
• From Duda, Hart and Stork:
– The act of taking in raw data and taking an
action based on the “category” of the
pattern.
BINF636 CLUSTERING AND CLASSIFICATION
87
Isn’t This Just Statistics?
• Short answer: yes.
• Breiman (Statistical
Sciences, 2001) suggests
there are two cultures
within statistical
modeling: Stochastic
Modelers and Algorithmic
Modelers.
BINF636 CLUSTERING AND CLASSIFICATION
88
Algorithmic Modeling
• Pattern recognition (classification) is concerned with
predicting class membership of an observation.
• This can be done from the perspective of (traditional
statistical) data models.
• Often, the data is high dimensional, complex, and of
unknown distributional origin.
• Thus, pattern recognition often falls into the “algorithmic
modeling” camp.
• The measure of performance is whether it accurately
predicts the class, not how well it models the
distribution.
• Empirical evaluations often are more compelling than
asymptotic theorems.
BINF636 CLUSTERING AND CLASSIFICATION
89
Pattern Recognition Flowchart
BINF636 CLUSTERING AND CLASSIFICATION
90
Pattern Recognition Concerns
• Feature extraction and distance calculation
• Development of automated algorithms for classification.
• Classifier performance evaluation.
• Latent or hidden class discovery based on etracted
feature analysis.
• Theoretical considerations.
BINF636 CLUSTERING AND CLASSIFICATION
91
Linear and Quadratic
Discriminant Analysis in Action
BINF636 CLUSTERING AND CLASSIFICATION
92
Nearest Neighbor Classifier
BINF636 CLUSTERING AND CLASSIFICATION
93
SVM Training Cartoon
BINF636 CLUSTERING AND CLASSIFICATION
94
CART Analysis of the Fisher Iris
Data
BINF636 CLUSTERING AND CLASSIFICATION
95
Random Forests
• Create a large number of trees based on random samples
of our dataset.
• Use a bootstrap sample for each random sample.
• Variables used to create the splits are a random subsample of all of the features.
• All trees are grown fully.
• Majority vote determines membership of a new
observation.
BINF636 CLUSTERING AND CLASSIFICATION
96
Boosting and Bagging
BINF636 CLUSTERING AND CLASSIFICATION
97
Boosting
BINF636 CLUSTERING AND CLASSIFICATION
98
Evaluating Classifiers
BINF636 CLUSTERING AND CLASSIFICATION
99
Resubstitution
BINF636 CLUSTERING AND CLASSIFICATION
100
Cross Validation
BINF636 CLUSTERING AND CLASSIFICATION
101
Leave-k-Out
BINF636 CLUSTERING AND CLASSIFICATION
102
Cross-Validation Notes
BINF636 CLUSTERING AND CLASSIFICATION
103
Test Set
BINF636 CLUSTERING AND CLASSIFICATION
104
Some Classifier Results on the Golub
ALL vs AML Dataset
BINF636 CLUSTERING AND CLASSIFICATION
105
References - I
•
Richard O. Duda, Peter E. Hart, David G. Stork, 2001, Pattern Calssification,
2nd Edition.
•
Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and
Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95,
14863-8.
•
Brian S. Everitt, Sabine Landau, Morven Leese ,(2001), Cluster Analysis,4th
Edition, arnold.
•
Gasch AP and Eisen MB (2002). Exploring the conditional coregulation of yeast
gene expression through fuzzy k-means clustering. Genome Biology 3(11), 1-22.
•
Gad Getz, Erel Levine, and Eytan Domany . (2000) Coupled two-way clustering
analysis of gene microarray data, PNAS, vol. 97, no. 22, pp. 12079–12084.
•
Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC,
Botstein D and Brown P. (2000). 'Gene Shaving' as a Method for Identifying
Distinct Sets of Genes with Similar Expression Patterns. GenomeBiology.com 1,
BINF636 CLUSTERING AND CLASSIFICATION
106
References - II
•
A. K. Jain, M. N. Murty, P. J. Flynn , (1999) Data clustering: a review, ACM
Computing Surveys (CSUR), Volume 31 Issue 3.
•
John Quackenbush, (2001),COMPUTATIONAL ANALYSIS OF MICROARRAY
DATA NATURE REVIEWS GENETICS VOLUME 2, 419, pp. 418 427
•
Ying Xu, Victor Olman, and Dong Xu (2002)Clustering gene expression
data using a graph-theoretic approach: an application of minimum
spanning trees Bioinformatics 2002 18: 536-545.
BINF636 CLUSTERING AND CLASSIFICATION
107
References - III
• Hastie, Tibshirani, Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, 2001.
• Devroye, Györfi, Lugosi, A Probabilistic Theory of
Pattern Recognition,1996.
• Ripley, Pattern Recognition and Neural Networks, 1996.
• Fukunaga, Introduction to Statistical Pattern Recognition,
1990.
BINF636 CLUSTERING AND CLASSIFICATION
108