Erte Pan - University of Houston
Download
Report
Transcript Erte Pan - University of Houston
Mercer Kernel-Based Clustering in Feature Space
For Math6397 Prof. Azencott
Erte Pan
Wireless Eng. Group
Advisor: Dr. Han
Department of Electrical and Computer Engineering
University of Houston, Houston, TX.
Author: Mark Girolami
Submitted in IEEE Transactions on Neural
Netwroks, Vol.3, May, 2002
Citations so far: 593
Content
Problem Statement
Data-space Clustering
Feature-space Clustering
Stochastic Optimization
Nonparametric Clustering
Results and Discussion
References
Problem Statement
A lot of data analysis or machine learning tasks involve classification
of data clouds or prediction of incoming data point.
Machine Learning: Enable computers to learn without being
explicitly programmed.
Unsupervised Learning
Supervised Learning
Data-space Clustering
Clustering: Unsupervised partitioning of data observations into selfsimilar regions.
Traditional clustering method:
Centroid-based clustering
Hierarchical clustering
Distribution-based clustering…
Data-space Clustering
Problem formulation:
N data vectors in D-dimension space:
xn , n 1,2..., N xn D
Given K cluster centers the within-cluster scatter matrix is defined
as:
1 K N
SW zkn ( xn mk )( xn mk )T
N k 1 n 1
where the binary variable z kn indicates the membership of data
point xn to cluster k
1
mk
Nk
N
N
z
n 1
x
kn n
N k zkn
n 1
Data-space Clustering
Data-space clustering criterion: sum-of-squares; measure of
compactness
K-means, mean shift and so forth…
The partition of data set is solved by the optimization problem:
Z arg minTr( SW )
Z
NP-hard problem… heuristic algorithms such as Lloyd’s algorithm:
Initialize centroid for given number of clusters k
Assign each data point to the “nearest” mean(Voronoi diagram)
Update centroids of each clusters
Data-space Clustering
Drawbacks of Data-space clustering:
linear separation boundaries.
prefer similar size of each cluster.
equally weighted in each dimension.
number of clusters, K, has to be determined at the beginning.
might stuck into a local minimum.
sensitive to initialization and outliers.
Feature-space clustering is proposed to address those problems,
hopefully…
Feature-space Clustering
Same story as in Kernel PCA that everyone can recite…
: D F
xX
Feature-space Clustering
Computation in feature space, utilizing the kernel trick:
1 K N
Tr( S ) Tr{ zkn (( xn ) mk )(( xn ) mk )T }
N k 1 n 1
W
1 K N
Tr( S ) zkn (( xn ) mk )T (( xn ) mk )
N k 1 n 1
W
Using the Mercer Kernels, the Gram Matrix is:
K ji K ij k ( xi , x j ) ( xi ), ( x j )
Denote the term:
2
ykn K nn
Nk
then:
N
1 N N
zki zkl K il
j 1 zkj K nj N 2
k i 1 l 1
1 K N
Tr( S ) zkn ykn
N k 1 n 1
W
Feature-space Clustering
Denote the following terms:
k
Nk
R( x | Ck ) N k2 i 1 j 1 z ki z kj K ij
N
N
N
Then the straightforward manipulation of the equations yield:
K
1 K N
Tr( S ) zkn K nn k R( x | Ck )
N k 1 n 1
k 1
W
If the Radial Basis Function kernel is used:
k ( xi , x j ) exp{(1 / c) || xi x j ||2 }
Then the first term reduces to unity, thus:
K
Tr( S ) 1 k R( x | Ck )
W
k 1
R( x | Ck ) captures the quadratic sum of the elements allocated to
the k-th cluster
Feature-space Clustering
For the RBF kernel, the following approximation hold due to the
convolution theorem for Gaussians(why?):
1
p
(
x
)
dx
x
N2
2
N
N
K
i 1 j 1
ij
This being the case, then:
1
R ( x | Ck ) 2
Nk
N
N
z
i 1 j 1
z K ij
ki kj
2
p
(
x
|
C
)
dx
k
xC k
It make sense for the clustering later on, because the integral is the
measurement of the compactness of the cluster
Connectivity to Probability Statistics; Validation of the kernel
model. (What if not RBF kernels? This proves they are not valid?)
Feature-space Clustering
2
p
(
x
)
dx :
Make sense of the integral x
x
p ( x) 2 dx E{ p ( x)}
Utilizing the Cauchy’s Inequality in statistics:
E{ p ( x) 1}
E{ p ( x) 2 } E{1}
The equality holds when p ( x ) a 1 , which means the more
“uniformly” distributed data, the more compact cluster.
Examples:
Gaussians:
x
p ( x) 2 dx 1
2
Feature-space Clustering
The integral represented by R( x | Ck ) is the contrast to the
Euclidean compactness measure defined by the sum-of-squares
term.
Now the optimization problem in the feature-space becomes:
K
Z arg minTr( S ) arg max k R( x | Ck )
W
Z
Z
k 1
Lemma: If the binary restriction for zkn is relaxed to 0 z kn 1 ,
then the optimization above is achieved with Z matrix being binary.
Interpretation: the optimal partitioning of data will only occur when
the partition indexes are 0 or 1.
This validates the use of stochastic methods in optimizing.
Stochastic Optimization
1
Dkj 1
Nk
N
z
kl
K jl
Define
as the penalty associated with assigning
l 1
the j-th data point to the k-th cluster in feature-space.
Due to the nature of RBF kernel, k ( xi , x j ) exp{(1/ c) || xi x j || } the
range of each element of K would be (0,1].
2
The second term of the penalty can be viewed as estimate of the
conditional probability of the j-th data given the k-th cluster.
The original objective of optimization problem is manipulated into:
N
z
1 N K
Tr( S ) 1 zkj kl K jl
N j 1 k 1 l 1 N k
W
1 N K
Tr( S ) zkj Dkj
N j 1 k 1
W
Stochastic Optimization
Analog to the stochastic optimization in data-space:
1 N K
Tr( SW ) zkn Ekn
N n 1 k 1
where the Ekn is the sum-of-squares distance term.
Solved as the fashion of the Expectation Maximization algorithm:
The cluster indicator zkn
is calculated according to its
expectation employing softmax function:
z kn
exp( Eknnew )
K
new
exp(
E
k n )
k
each Ekn || xk mk || is then updated by the newly
estimated expectation values of the indicators zkn
new
2
Stochastic Optimization
Similarly, the stochastic optimization in feature-space:
z kn
exp( yknnew )
exp(y
k 1
new
k n
)
K
new
exp(
2
D
k
k n )
k 1
k exp( R( x | Ck ) )
where:
new
kn
D
note that
K
k exp(2Dknnew )
k
1 N
1
zkl K jl
N k l 1
indicates the compactness of the k-th cluster.
Stochastic Search
Stochastic method for optimization
Different optimization criteria in traditional method and stochastic method
for optimization purpose:
Traditional: Error criterion. BP method strictly goes along the gradient
descent direction. Any direction that enlarge error is NOT acceptable. Easy to
get stuck in local minima.
BM: associate the system with “Energy”. Simulated Annealing enables the
energy to grow under certain probability.
Simulated Annealing
Simulated Annealing:
1. Create initial solution Z (global states of the system)
Initialize temperature T>>1
2. Repeat until T =T-lower-bound
Repeat until thermal equilibrium is reached at
the current T
• Generate a random transition from Z to Z’
This term allows “thermal
• Let E = E(Z’) E(Z)
disturbance” which facilitate
finding global minimum
• if E < 0 then Z = Z’
• else if exp[E/T] > rand(0,1) then Z = Z’
Reduce temperature T according to the
cooling schedule
3. Return Z
Nonparametric Clustering
Nonparametric: No assumptions on the number of clusters.
Observations:
the kernel matrix will have a block diagonal structure when
there are definite clusters within the data.
eigenvectors of a permuted matrix are the permutations of
the original matrix and therefore, an indication of the number
of clusters may be given from the eigen-decomposition of
kernel matrix.
Recall the approximation:
1
x p( x) dx N 2
2
N
N
K
i 1 j 1
ij
Nonparametric Clustering
1
p
(
x
)
dx
Moreover, x
N2
2
N
N
K
i 1 j 1
ij
1TN K1N
Eigen-decomposition of K gives:
K UU T
Thus we have:
N
N
1 K1N 1 { i ui u }1N i {1TN ui }2
T
N
T
N
i 1
T
i
i 1
This indicates that if there are K distinct clusters within the data
samples then there will be K dominant terms in i {1TN ui }2 (Why?)
Nonparametric Clustering
Examples on phantom data sets:
Results and Discussion
Results on phantom 3 data sets: Fisher Iris; Wine data set; Crabs
data.
Results and Discussion
Conclusions and discussions:
the mean vector in feature-space may not serve as
representatives or prototypes of the input space clusters.
the block-diagonal structure of the kernel matrix can be
exploited in estimating the number of possible clusters.
choice of kernel will be data specific.
the RBF kernels link the sum-of-squares criterion with the
probability metric.
the choice of the parameter of RBF kernel should be
determined by the cross-validation or the leave-one-out
technique.
eigen-decomposition of N x N kernel matrix scales as O(N^3)
Results and Discussion
Remarks of my own:
most appealing point is the link between distance metric and
the probability metric.
unclear about why prefer to use the stochastic optimizing
instead of ordinary optimizing methods.
no assessment on other types of kernels.
unclear about how to permute the kernel matrix to get the
block-diagonal structure.
the “super technical” term “dominant” i {1N ui } in the
non-parametric part is too vague; needs some quantification.
T
2
References
“Data clustering and data visualization”, in Learning in Graphical
Models,1998.
“A projection pursuit algorithm for exploratory data analysis”, IEEE
Trans. Comput., 1974.
“An algorithm for Euclidean sum-of-squares classification”,
Biometrics, 1988
“Maximum certainty data partitioning”, Pattern Recognition, 2000.
“An expectation maximization approach to nonlinear component
analysis”, Neural Comput., 2001
Questions?
Thank you!