Erte Pan - University of Houston

Download Report

Transcript Erte Pan - University of Houston

Mercer Kernel-Based Clustering in Feature Space
For Math6397 Prof. Azencott
Erte Pan
Wireless Eng. Group
Advisor: Dr. Han
Department of Electrical and Computer Engineering
University of Houston, Houston, TX.
Author: Mark Girolami
Submitted in IEEE Transactions on Neural
Netwroks, Vol.3, May, 2002
Citations so far: 593
Content
 Problem Statement
 Data-space Clustering
 Feature-space Clustering
 Stochastic Optimization
 Nonparametric Clustering
 Results and Discussion
 References
Problem Statement
 A lot of data analysis or machine learning tasks involve classification
of data clouds or prediction of incoming data point.
 Machine Learning: Enable computers to learn without being
explicitly programmed.
 Unsupervised Learning
 Supervised Learning
Data-space Clustering
 Clustering: Unsupervised partitioning of data observations into selfsimilar regions.
 Traditional clustering method:
 Centroid-based clustering
 Hierarchical clustering
 Distribution-based clustering…
Data-space Clustering
 Problem formulation:
 N data vectors in D-dimension space:
xn , n  1,2..., N xn   D
 Given K cluster centers the within-cluster scatter matrix is defined
as:
1 K N
SW   zkn ( xn  mk )( xn  mk )T
N k 1 n 1
 where the binary variable z kn indicates the membership of data
point xn to cluster k
1
mk 
Nk
N
N
z
n 1
x
kn n
N k   zkn
n 1
Data-space Clustering
 Data-space clustering criterion: sum-of-squares; measure of
compactness
 K-means, mean shift and so forth…
 The partition of data set is solved by the optimization problem:
Z  arg minTr( SW )
Z
 NP-hard problem… heuristic algorithms such as Lloyd’s algorithm:
 Initialize centroid for given number of clusters k
 Assign each data point to the “nearest” mean(Voronoi diagram)
 Update centroids of each clusters
Data-space Clustering
 Drawbacks of Data-space clustering:
 linear separation boundaries.
 prefer similar size of each cluster.
 equally weighted in each dimension.
 number of clusters, K, has to be determined at the beginning.
 might stuck into a local minimum.
 sensitive to initialization and outliers.
 Feature-space clustering is proposed to address those problems,
hopefully…
Feature-space Clustering
 Same story as in Kernel PCA that everyone can recite…
 : D  F
xX
Feature-space Clustering
 Computation in feature space, utilizing the kernel trick:
1 K N
Tr( S )  Tr{  zkn (( xn )  mk )(( xn )  mk )T }
N k 1 n 1

W
1 K N
Tr( S )   zkn (( xn )  mk )T (( xn )  mk )
N k 1 n 1

W
 Using the Mercer Kernels, the Gram Matrix is:
K ji  K ij  k ( xi , x j )  ( xi ), ( x j ) 
 Denote the term:
2
ykn  K nn 
Nk
 then:
N
1 N N
zki zkl K il
j 1 zkj K nj  N 2 
k i 1 l 1
1 K N
Tr( S )   zkn ykn
N k 1 n 1

W
Feature-space Clustering
 Denote the following terms:
k 
Nk
R( x | Ck )  N k2 i 1  j 1 z ki z kj K ij
N
N
N
 Then the straightforward manipulation of the equations yield:
K
1 K N
Tr( S )   zkn K nn    k R( x | Ck )
N k 1 n 1
k 1

W
 If the Radial Basis Function kernel is used:
k ( xi , x j )  exp{(1 / c) || xi  x j ||2 }
 Then the first term reduces to unity, thus:
K
Tr( S )  1    k R( x | Ck )

W

k 1
R( x | Ck ) captures the quadratic sum of the elements allocated to
the k-th cluster
Feature-space Clustering
 For the RBF kernel, the following approximation hold due to the
convolution theorem for Gaussians(why?):
1
p
(
x
)
dx

x
N2
2
N
N
 K
i 1 j 1
ij
 This being the case, then:
1
R ( x | Ck )  2
Nk
N
N
 z
i 1 j 1
z K ij 
ki kj
2
p
(
x
|
C
)
dx
k

xC k
 It make sense for the clustering later on, because the integral is the
measurement of the compactness of the cluster
 Connectivity to Probability Statistics; Validation of the kernel
model. (What if not RBF kernels? This proves they are not valid?)
Feature-space Clustering
2
p
(
x
)
dx :
 Make sense of the integral x

x
p ( x) 2 dx  E{ p ( x)}
 Utilizing the Cauchy’s Inequality in statistics:
E{ p ( x) 1} 
E{ p ( x) 2 } E{1}
 The equality holds when p ( x )  a 1 , which means the more
“uniformly” distributed data, the more compact cluster.
 Examples:
 Gaussians:

x
p ( x) 2 dx  1
2 
Feature-space Clustering
 The integral represented by R( x | Ck ) is the contrast to the
Euclidean compactness measure defined by the sum-of-squares
term.
 Now the optimization problem in the feature-space becomes:
K
Z  arg minTr( S )  arg max   k R( x | Ck )

W
Z
Z
k 1
 Lemma: If the binary restriction for zkn is relaxed to 0  z kn  1 ,
then the optimization above is achieved with Z matrix being binary.
 Interpretation: the optimal partitioning of data will only occur when
the partition indexes are 0 or 1.
 This validates the use of stochastic methods in optimizing.
Stochastic Optimization
1
Dkj  1 
Nk
N
z
kl
K jl
 Define
as the penalty associated with assigning
l 1
the j-th data point to the k-th cluster in feature-space.
 Due to the nature of RBF kernel, k ( xi , x j )  exp{(1/ c) || xi  x j || } the
range of each element of K would be (0,1].
2
 The second term of the penalty can be viewed as estimate of the
conditional probability of the j-th data given the k-th cluster.
 The original objective of optimization problem is manipulated into:
N
z
1 N K
Tr( S )  1   zkj  kl K jl
N j 1 k 1 l 1 N k

W
1 N K
Tr( S )   zkj Dkj
N j 1 k 1

W
Stochastic Optimization
 Analog to the stochastic optimization in data-space:
1 N K

Tr( SW )   zkn Ekn
N n 1 k 1
 where the Ekn is the sum-of-squares distance term.
 Solved as the fashion of the Expectation Maximization algorithm:
 The cluster indicator zkn
is calculated according to its
expectation employing softmax function:
 z kn 
exp( Eknnew )
K
new
exp(


E

k n )
k
 each Ekn || xk   mk || is then updated by the newly
estimated expectation values of the indicators zkn
new
2
Stochastic Optimization
 Similarly, the stochastic optimization in feature-space:
 z kn 
exp( yknnew )
 exp(y
k  1
new
k n
)
K
new

exp(

2

D
 k
k n )
k  1
 k  exp(   R( x | Ck ) )
 where:
new
kn
D
 note that

K
 k exp(2Dknnew )
k
1 N
 1
 zkl  K jl

 N k  l 1
indicates the compactness of the k-th cluster.
Stochastic Search
 Stochastic method for optimization
 Different optimization criteria in traditional method and stochastic method
for optimization purpose:
 Traditional: Error criterion. BP method strictly goes along the gradient
descent direction. Any direction that enlarge error is NOT acceptable. Easy to
get stuck in local minima.
 BM: associate the system with “Energy”. Simulated Annealing enables the
energy to grow under certain probability.
Simulated Annealing
 Simulated Annealing:
1. Create initial solution Z (global states of the system)
Initialize temperature T>>1
2. Repeat until T =T-lower-bound
 Repeat until thermal equilibrium is reached at
the current T
• Generate a random transition from Z to Z’
This term allows “thermal
• Let E = E(Z’)  E(Z)
disturbance” which facilitate
finding global minimum
• if E < 0 then Z = Z’
• else if exp[E/T] > rand(0,1) then Z = Z’
 Reduce temperature T according to the
cooling schedule
3. Return Z
Nonparametric Clustering
 Nonparametric: No assumptions on the number of clusters.
 Observations:
 the kernel matrix will have a block diagonal structure when
there are definite clusters within the data.
 eigenvectors of a permuted matrix are the permutations of
the original matrix and therefore, an indication of the number
of clusters may be given from the eigen-decomposition of
kernel matrix.
 Recall the approximation:
1
x p( x) dx  N 2
2
N
N
 K
i 1 j 1
ij
Nonparametric Clustering
1
p
(
x
)
dx

 Moreover, x
N2
2
N
N
 K
i 1 j 1
ij
 1TN K1N
 Eigen-decomposition of K gives:
K  UU T
 Thus we have:
N
N
1 K1N  1 { i ui u }1N   i {1TN ui }2
T
N
T
N
i 1
T
i
i 1
This indicates that if there are K distinct clusters within the data
samples then there will be K dominant terms in i {1TN ui }2 (Why?)
Nonparametric Clustering
 Examples on phantom data sets:
Results and Discussion
 Results on phantom 3 data sets: Fisher Iris; Wine data set; Crabs
data.
Results and Discussion
 Conclusions and discussions:
 the mean vector in feature-space may not serve as
representatives or prototypes of the input space clusters.
 the block-diagonal structure of the kernel matrix can be
exploited in estimating the number of possible clusters.
 choice of kernel will be data specific.
 the RBF kernels link the sum-of-squares criterion with the
probability metric.
 the choice of the parameter of RBF kernel should be
determined by the cross-validation or the leave-one-out
technique.
 eigen-decomposition of N x N kernel matrix scales as O(N^3)
Results and Discussion
 Remarks of my own:
 most appealing point is the link between distance metric and
the probability metric.
 unclear about why prefer to use the stochastic optimizing
instead of ordinary optimizing methods.
 no assessment on other types of kernels.
 unclear about how to permute the kernel matrix to get the
block-diagonal structure.
 the “super technical” term “dominant” i {1N ui } in the
non-parametric part is too vague; needs some quantification.
T
2
References
 “Data clustering and data visualization”, in Learning in Graphical
Models,1998.
 “A projection pursuit algorithm for exploratory data analysis”, IEEE
Trans. Comput., 1974.
 “An algorithm for Euclidean sum-of-squares classification”,
Biometrics, 1988
 “Maximum certainty data partitioning”, Pattern Recognition, 2000.
 “An expectation maximization approach to nonlinear component
analysis”, Neural Comput., 2001
Questions?
Thank you!