Transcript Document

CLUSTERING
 Basic Concepts
In clustering or unsupervised learning no training data,
with class labeling, are available. The goal becomes:
Group the data into a number of sensible clusters
(groups). This unravels similarities and differences among
the available data.
 Applications:
• Engineering
• Bioinformatics
• Social Sciences
• Medicine
• Data and Web Mining
 To perform clustering of a data set, a clustering
criterion must first be adopted. Different clustering
criteria lead, in general, to different clusters.
1
 A simple example
Blue shark,
sheep, cat,
dog
Gold fish, red
mullet, blue
shark
Lizard, sparrow,
viper, seagull, gold
fish, frog, red
mullet
1. Two clusters
2. Clustering criterion:
How mammals bear
their progeny
Sheep, sparrow,
dog, cat, seagull,
lizard, frog, viper
1. Two clusters
2. Clustering criterion:
Existence of lungs
2
 Clustering task stages
 Feature Selection: Information rich features-Parsimony
 Proximity Measure: This quantifies the term similar or
dissimilar.
 Clustering Criterion: This consists of a cost function or
some type of rules.
 Clustering Algorithm: This consists of the set of
steps followed to reveal the structure, based on the
similarity measure and the adopted criterion.
 Validation of the results.
 Interpretation of the results.
3
 Depending on the similarity measure, the clustering
criterion and the clustering algorithm different clusters
may result. Subjectivity is a reality to live with from
now on.
 A simple example: How many clusters??
2 or 4 ??
4
 Basic application areas for clustering
 Data reduction. All data vectors within a cluster are
substituted (represented) by the corresponding cluster
representative.
 Hypothesis generation.
 Hypothesis testing.
 Prediction based on groups.
5
 Clustering Definitions
 Hard Clustering: Each point belongs to a single cluster
• Let X  {x1 , x 2 ,...,x N }
• An m-clustering R of X, is defined as the
partition of X into m sets (clusters), C1,
C2,…,Cm, so that
–
–
–
Ci  , i  1,2,...,m
m
U Ci  X
i 1
Ci  C j  , i  j, i, j  1,2,...,m
In addition, data in Ci are more similar to each
other and less similar to the data in the rest of the
clusters. Quantifying the terms similar-dissimilar
depends on the types of clusters that are expected
to underlie the structure of X.
6
 Fuzzy clustering: Each point belongs to all clusters up
to some degree.
A fuzzy clustering of X into m clusters is characterized
by m functions
•
u j : x  [0,1], j  1,2,...,m
m
•
 u ( x )  1, i  1,2,...,N
j 1
j
i
N
•
0   u j ( x i )  N , j  1,2,...,m
i 1
7
These are known as membership functions.
Thus, each xi belongs to any cluster “up to
some degree”, depending on the value of
u j ( xi ), j  1,2,...,m
u j ( xi ) close to 1  high grade of
membershipof xi to cluster j.
u j ( x i ) close to 0 
low grade of membership.
8
TYPES OF FEATURES
 With respect to their domain
 Continuous (the domain is a continuous subset of ).
 Discrete (the domain is a finite discrete set).
• Binary or dichotomous (the domain consists of two possible values).
 With respect to the relative significance of the values they
take
 Nominal (the values code states, e.g., the sex of an individual).
 Ordinal (the values are meaningfully ordered, e.g., the rating of the
services of a hotel (poor, good, very good, excellent)).
 Interval-scaled (the difference of two values is meaningful but their
ratio is meaningless, e.g., temperature).
 Ratio-scaled (the ratio of two values is meaningful, e.g., weight).
9
PROXIMITY MEASURES
 Between vectors
Dissimilarity measure (between vectors of X) is a
function
d: XX 

with the following properties
•
d0   :   d0  d ( x, y)  , x, y  X
•
d ( x, x)  d0 , x  X
•
d ( x, y)  d ( y, x), x, y  X
10
If in addition
•
d ( x, y )  d 0 if and only if x  y
•
d ( x, z)  d ( x, y)  d ( y, z), x, y, z  X
(triangular inequality)
d is called a metric dissimilarity measure.
11
 Similarity measure (between vectors of X) is a
function
s: X X 

with the following properties
•
•
s0   :   s( x, y)  s0  , x, y  X
s( x, x)  s0 , x  X
• s( x,
y)  s( y, x), x, y  X
12
If in addition
• s ( x, y )  s0 if and only if x  y
•
s( x, y)s( y, z)  [s( x, y)  s( y, z)]s( x, z), x, y, z  X
s is called a metric similarity measure.
 Between sets
Let Di  X, i=1,…,k and U={D1,…,Dk}
A proximity measure  on U is a function
: U U 

A dissimilarity measure has to satisfy the relations of
dissimilarity measure between vectors, where Di’ ‘s are used
in place of x, y (similarly for similarity measures).
13
PROXIMITY MEASURES BETWEEN VECTORS
 Real-valued vectors

Dissimilarity measures (DMs)
• Weighted lp metric DMs
l
d p ( x, y )  ( wi | xi  yi | p )1/ p
i 1
Interesting instances are obtained for
– p=1 (weighted Manhattan norm)
– p=2 (weighted Euclidean norm)
– p=∞ (d(x,y)=max1il wi|xi-yi| )
14
• Other measures
–
 1 l | xj  yj |

d G ( x, y )   log10 1  
 l j 1 b  a 
j
j 

where bj and aj are the maximum and the minimum
values of the j-th feature, among the vectors of X
(dependence on the current data set)
1  x j  y j 
dQ ( x, y)  
l j 1  x j  y j 
l
–
2
15
 Similarity measures
• Inner product
l
sinner ( x, y)  x y   xi yi
T
i 1
• Tanimoto measure
sT ( x, y ) 
•
T
x y
T
|| x ||2  || y ||2  x y
sT ( x, y )  1 
d 2 ( x, y )
|| x ||  || y ||
16
 Discrete-valued vectors
 Let F={0,1,…,k-1} be a set of symbols and X={x1,…,xN}  Fl
 Let A(x,y)=[aij], i, j=0,1,…,k-1, where aij is the number of places where
x has the i-th symbol and y has the j-th symbol.
k 1 k 1
NOTE:
 a
i 0 j 0
ij
l
Several proximity measures can be expressed as combinations of the
elements of A(x,y).
 Dissimilarity measures:
• The Hamming distance (number of places where x and y differ)
k 1 k 1
d H ( x, y )   aij
i 0 j 0
j i
• The l1 distance
l
d1 ( x, y )   | xi  yi |
i 1
17
 Similarity measures:
k 1
• Tanimoto measure : sT ( x, y ) 
a
i 1
ii
k 1 k 1
nx  n y   aij
i 1 j 1
k 1 k 1
where
nx   aij ,
i 1 j  0
k 1 k 1
n y   aij ,
• Measures that exclude a00:
• Measures that include a00:
i  0 j 1
k 1
k 1
a
i 1
/l
ii
i 1
k 1
a
i 0
a
ii
ii
/(l  a00 )
/l
18
 Mixed-valued vectors
Some of the coordinates of the vectors x are real and the rest are
discrete.
Methods for measuring the proximity between two such xi and xj:
 Adopt a proximity measure (PM) suitable for real-valued vectors.
 Convert the real-valued features to discrete ones and employ a
discrete PM.
The more general case of mixed-valued vectors:
 Here nominal, ordinal, interval-scaled, ratio-scaled features are
treated separately.
19
The similarity function between xi and xj is:
l
l
q 1
q 1
s( x i , x j )   sq ( x i , x j ) /  wq
In the above definition:
• wq=0, if at least one of the q-th coordinates of xi and xj are
undefined or both the q-th coordinates are equal to 0.
Otherwise wq=1.
• If the q-th coordinates are binary, sq(xi,xj)=1 if xiq=xjq=1 and 0
otherwise.
• If the q-th coordinates are nominal or ordinal, sq(xi,xj)=1 if xiq=xjq
and 0 otherwise.
• If the q-th coordinates are interval or ratio scaled-valued
sq ( xi , x j )  1 | xiq  x jq | / rq ,
where rq is the interval where the q-th coordinates of the
vectors of the data set X lie.
20
 Fuzzy measures
Let x, y[0,1]l. Here the value of the i-th coordinate, xi, of x, is
not the outcome of a measuring device.
 The closer the coordinate xi is to 1 (0), the more likely the
vector x possesses (does not possess) the i-th characteristic.
 As xi approaches 0.5, the certainty about the possession or
not of the i-th feature from x decreases.
A possible similarity measure that can quantify the above is:
s( xi , yi )  max(min(1  xi ,1  yi ), min(xi , yi ))
Then
1/ q


sFq ( x, y)    s( xi , yi ) q 
 i 1

l
21
 Missing data
For some vectors of the data set X, some features values are unknown
Ways to face the problem:
 Discard all vectors with missing values (not recommended for small
data sets)
 Find the mean value mi of the available i-th feature values over that
data set and substitute the missing i-th feature values with mi.
 Define bi=0, if both the i-th features xi, yi are available and 1
otherwise. Then
l
( x, y ) 
 ( xi , yi )

l
l  i 1 bi all i: bi 0
where (xi,yi) denotes the PM between two scalars xi, yi.
 Find the average proximities avg(i) between all feature vectors in X
along all components. Then
l
( x, y )   ( xi , yi )
i 1
where (xi,yi)=(xi,yi), if both xi and yi are available and avg(i)
otherwise.
22
PROXIMITY FUNCTIONS BETWEEN A
VECTOR AND A SET
 Let X={x1,x2,…,xN} and C  X, x  X
 All points of C contribute to the definition of (x, C)
 Max proximity function
ps
max
( x, C )  maxyC ( x, y)
 Min proximity function
ps
min
( x, C )  minyC ( x, y)
 Average proximity function
ps
avg
( x, C ) 
1
nC
( x, y)
yC
(nC is the cardinality of C)
23
 A representative(s) of C, rC , contributes to the definition of
(x,C)
In this case: (x,C)=(x,rC)
Typical representatives are:
 The mean vector:
m p   1   y
 nC  yC
where nC is the cardinality of C
 The mean center:
mC  C :
 d (m
yC
d: a dissimilarity
C
, y )   d ( z , y ),  z  C
measure
yC
 The median center:
mmed C : med(d (mmed , y) | y C)  med(d ( z, y) | y C), z  C
NOTE: Other representatives (e.g., hyperplanes, hyperspheres) are
useful in certain applications (e.g., object identification using
clustering techniques).
24
PROXIMITY FUNCTIONS BETWEEN SETS
 Let X={x1,…,xN}, Di, Dj  X and ni=|Di|, nj=|Dj|
 All points of each set contribute to (Di,Dj)
 Max proximity function (measure but not metric, only if  is a
similarity measure)
ss
max
( Di , D j )  maxxDi , yD j ( x, y)
 Min proximity function (measure but not metric, only if  is a
dissimilarity measure)
ss
min
( Di , D j )  minxDi , yD j ( x, y)
 Average proximity function (not a measure, even if  is a
measure)


ss
avg
( Di , D j )   1
  ( x, y)
n
n
 i j  xDi xD j
25
 Each set Di is represented by its representative vector mi
 Mean proximity function (it is a measure provided that  is a
measure):
ss
mean
(Di , Dj ) (mi , m j )
ni n j
ss

(
D
,
D
)

(mi , m j )
e
i
j

ni  n j
NOTE: Proximity functions between a vector x and a set C may be
derived from the above functions if we set Di={x}.
26
 Remarks:
• Different choices of proximity functions between sets may
lead to totally different clustering results.
• Different proximity measures between vectors in the same
proximity function between sets may lead to totally different
clustering results.
• The only way to achieve a proper clustering is
 by trial and error and,
 taking into account the opinion of an expert in the field of
application.
27