Transcript Document
CLUSTERING
Basic Concepts
In clustering or unsupervised learning no training data,
with class labeling, are available. The goal becomes:
Group the data into a number of sensible clusters
(groups). This unravels similarities and differences among
the available data.
Applications:
• Engineering
• Bioinformatics
• Social Sciences
• Medicine
• Data and Web Mining
To perform clustering of a data set, a clustering
criterion must first be adopted. Different clustering
criteria lead, in general, to different clusters.
1
A simple example
Blue shark,
sheep, cat,
dog
Gold fish, red
mullet, blue
shark
Lizard, sparrow,
viper, seagull, gold
fish, frog, red
mullet
1. Two clusters
2. Clustering criterion:
How mammals bear
their progeny
Sheep, sparrow,
dog, cat, seagull,
lizard, frog, viper
1. Two clusters
2. Clustering criterion:
Existence of lungs
2
Clustering task stages
Feature Selection: Information rich features-Parsimony
Proximity Measure: This quantifies the term similar or
dissimilar.
Clustering Criterion: This consists of a cost function or
some type of rules.
Clustering Algorithm: This consists of the set of
steps followed to reveal the structure, based on the
similarity measure and the adopted criterion.
Validation of the results.
Interpretation of the results.
3
Depending on the similarity measure, the clustering
criterion and the clustering algorithm different clusters
may result. Subjectivity is a reality to live with from
now on.
A simple example: How many clusters??
2 or 4 ??
4
Basic application areas for clustering
Data reduction. All data vectors within a cluster are
substituted (represented) by the corresponding cluster
representative.
Hypothesis generation.
Hypothesis testing.
Prediction based on groups.
5
Clustering Definitions
Hard Clustering: Each point belongs to a single cluster
• Let X {x1 , x 2 ,...,x N }
• An m-clustering R of X, is defined as the
partition of X into m sets (clusters), C1,
C2,…,Cm, so that
–
–
–
Ci , i 1,2,...,m
m
U Ci X
i 1
Ci C j , i j, i, j 1,2,...,m
In addition, data in Ci are more similar to each
other and less similar to the data in the rest of the
clusters. Quantifying the terms similar-dissimilar
depends on the types of clusters that are expected
to underlie the structure of X.
6
Fuzzy clustering: Each point belongs to all clusters up
to some degree.
A fuzzy clustering of X into m clusters is characterized
by m functions
•
u j : x [0,1], j 1,2,...,m
m
•
u ( x ) 1, i 1,2,...,N
j 1
j
i
N
•
0 u j ( x i ) N , j 1,2,...,m
i 1
7
These are known as membership functions.
Thus, each xi belongs to any cluster “up to
some degree”, depending on the value of
u j ( xi ), j 1,2,...,m
u j ( xi ) close to 1 high grade of
membershipof xi to cluster j.
u j ( x i ) close to 0
low grade of membership.
8
TYPES OF FEATURES
With respect to their domain
Continuous (the domain is a continuous subset of ).
Discrete (the domain is a finite discrete set).
• Binary or dichotomous (the domain consists of two possible values).
With respect to the relative significance of the values they
take
Nominal (the values code states, e.g., the sex of an individual).
Ordinal (the values are meaningfully ordered, e.g., the rating of the
services of a hotel (poor, good, very good, excellent)).
Interval-scaled (the difference of two values is meaningful but their
ratio is meaningless, e.g., temperature).
Ratio-scaled (the ratio of two values is meaningful, e.g., weight).
9
PROXIMITY MEASURES
Between vectors
Dissimilarity measure (between vectors of X) is a
function
d: XX
with the following properties
•
d0 : d0 d ( x, y) , x, y X
•
d ( x, x) d0 , x X
•
d ( x, y) d ( y, x), x, y X
10
If in addition
•
d ( x, y ) d 0 if and only if x y
•
d ( x, z) d ( x, y) d ( y, z), x, y, z X
(triangular inequality)
d is called a metric dissimilarity measure.
11
Similarity measure (between vectors of X) is a
function
s: X X
with the following properties
•
•
s0 : s( x, y) s0 , x, y X
s( x, x) s0 , x X
• s( x,
y) s( y, x), x, y X
12
If in addition
• s ( x, y ) s0 if and only if x y
•
s( x, y)s( y, z) [s( x, y) s( y, z)]s( x, z), x, y, z X
s is called a metric similarity measure.
Between sets
Let Di X, i=1,…,k and U={D1,…,Dk}
A proximity measure on U is a function
: U U
A dissimilarity measure has to satisfy the relations of
dissimilarity measure between vectors, where Di’ ‘s are used
in place of x, y (similarly for similarity measures).
13
PROXIMITY MEASURES BETWEEN VECTORS
Real-valued vectors
Dissimilarity measures (DMs)
• Weighted lp metric DMs
l
d p ( x, y ) ( wi | xi yi | p )1/ p
i 1
Interesting instances are obtained for
– p=1 (weighted Manhattan norm)
– p=2 (weighted Euclidean norm)
– p=∞ (d(x,y)=max1il wi|xi-yi| )
14
• Other measures
–
1 l | xj yj |
d G ( x, y ) log10 1
l j 1 b a
j
j
where bj and aj are the maximum and the minimum
values of the j-th feature, among the vectors of X
(dependence on the current data set)
1 x j y j
dQ ( x, y)
l j 1 x j y j
l
–
2
15
Similarity measures
• Inner product
l
sinner ( x, y) x y xi yi
T
i 1
• Tanimoto measure
sT ( x, y )
•
T
x y
T
|| x ||2 || y ||2 x y
sT ( x, y ) 1
d 2 ( x, y )
|| x || || y ||
16
Discrete-valued vectors
Let F={0,1,…,k-1} be a set of symbols and X={x1,…,xN} Fl
Let A(x,y)=[aij], i, j=0,1,…,k-1, where aij is the number of places where
x has the i-th symbol and y has the j-th symbol.
k 1 k 1
NOTE:
a
i 0 j 0
ij
l
Several proximity measures can be expressed as combinations of the
elements of A(x,y).
Dissimilarity measures:
• The Hamming distance (number of places where x and y differ)
k 1 k 1
d H ( x, y ) aij
i 0 j 0
j i
• The l1 distance
l
d1 ( x, y ) | xi yi |
i 1
17
Similarity measures:
k 1
• Tanimoto measure : sT ( x, y )
a
i 1
ii
k 1 k 1
nx n y aij
i 1 j 1
k 1 k 1
where
nx aij ,
i 1 j 0
k 1 k 1
n y aij ,
• Measures that exclude a00:
• Measures that include a00:
i 0 j 1
k 1
k 1
a
i 1
/l
ii
i 1
k 1
a
i 0
a
ii
ii
/(l a00 )
/l
18
Mixed-valued vectors
Some of the coordinates of the vectors x are real and the rest are
discrete.
Methods for measuring the proximity between two such xi and xj:
Adopt a proximity measure (PM) suitable for real-valued vectors.
Convert the real-valued features to discrete ones and employ a
discrete PM.
The more general case of mixed-valued vectors:
Here nominal, ordinal, interval-scaled, ratio-scaled features are
treated separately.
19
The similarity function between xi and xj is:
l
l
q 1
q 1
s( x i , x j ) sq ( x i , x j ) / wq
In the above definition:
• wq=0, if at least one of the q-th coordinates of xi and xj are
undefined or both the q-th coordinates are equal to 0.
Otherwise wq=1.
• If the q-th coordinates are binary, sq(xi,xj)=1 if xiq=xjq=1 and 0
otherwise.
• If the q-th coordinates are nominal or ordinal, sq(xi,xj)=1 if xiq=xjq
and 0 otherwise.
• If the q-th coordinates are interval or ratio scaled-valued
sq ( xi , x j ) 1 | xiq x jq | / rq ,
where rq is the interval where the q-th coordinates of the
vectors of the data set X lie.
20
Fuzzy measures
Let x, y[0,1]l. Here the value of the i-th coordinate, xi, of x, is
not the outcome of a measuring device.
The closer the coordinate xi is to 1 (0), the more likely the
vector x possesses (does not possess) the i-th characteristic.
As xi approaches 0.5, the certainty about the possession or
not of the i-th feature from x decreases.
A possible similarity measure that can quantify the above is:
s( xi , yi ) max(min(1 xi ,1 yi ), min(xi , yi ))
Then
1/ q
sFq ( x, y) s( xi , yi ) q
i 1
l
21
Missing data
For some vectors of the data set X, some features values are unknown
Ways to face the problem:
Discard all vectors with missing values (not recommended for small
data sets)
Find the mean value mi of the available i-th feature values over that
data set and substitute the missing i-th feature values with mi.
Define bi=0, if both the i-th features xi, yi are available and 1
otherwise. Then
l
( x, y )
( xi , yi )
l
l i 1 bi all i: bi 0
where (xi,yi) denotes the PM between two scalars xi, yi.
Find the average proximities avg(i) between all feature vectors in X
along all components. Then
l
( x, y ) ( xi , yi )
i 1
where (xi,yi)=(xi,yi), if both xi and yi are available and avg(i)
otherwise.
22
PROXIMITY FUNCTIONS BETWEEN A
VECTOR AND A SET
Let X={x1,x2,…,xN} and C X, x X
All points of C contribute to the definition of (x, C)
Max proximity function
ps
max
( x, C ) maxyC ( x, y)
Min proximity function
ps
min
( x, C ) minyC ( x, y)
Average proximity function
ps
avg
( x, C )
1
nC
( x, y)
yC
(nC is the cardinality of C)
23
A representative(s) of C, rC , contributes to the definition of
(x,C)
In this case: (x,C)=(x,rC)
Typical representatives are:
The mean vector:
m p 1 y
nC yC
where nC is the cardinality of C
The mean center:
mC C :
d (m
yC
d: a dissimilarity
C
, y ) d ( z , y ), z C
measure
yC
The median center:
mmed C : med(d (mmed , y) | y C) med(d ( z, y) | y C), z C
NOTE: Other representatives (e.g., hyperplanes, hyperspheres) are
useful in certain applications (e.g., object identification using
clustering techniques).
24
PROXIMITY FUNCTIONS BETWEEN SETS
Let X={x1,…,xN}, Di, Dj X and ni=|Di|, nj=|Dj|
All points of each set contribute to (Di,Dj)
Max proximity function (measure but not metric, only if is a
similarity measure)
ss
max
( Di , D j ) maxxDi , yD j ( x, y)
Min proximity function (measure but not metric, only if is a
dissimilarity measure)
ss
min
( Di , D j ) minxDi , yD j ( x, y)
Average proximity function (not a measure, even if is a
measure)
ss
avg
( Di , D j ) 1
( x, y)
n
n
i j xDi xD j
25
Each set Di is represented by its representative vector mi
Mean proximity function (it is a measure provided that is a
measure):
ss
mean
(Di , Dj ) (mi , m j )
ni n j
ss
(
D
,
D
)
(mi , m j )
e
i
j
ni n j
NOTE: Proximity functions between a vector x and a set C may be
derived from the above functions if we set Di={x}.
26
Remarks:
• Different choices of proximity functions between sets may
lead to totally different clustering results.
• Different proximity measures between vectors in the same
proximity function between sets may lead to totally different
clustering results.
• The only way to achieve a proper clustering is
by trial and error and,
taking into account the opinion of an expert in the field of
application.
27