Transcript Part 4

Clustering methods: Part 4
Clustering Cost function
Pasi Fränti
29.4.2014
Speech and Image Processing Unit
School of Computing
University of Eastern Finland
Data types
•
•
•
•
•
Numeric
Binary
Categorical
Text
Time series
Part I:
Numeric data
Distance measures
Type
Possible operations Example
variable
Example values
Nominal
==
Major subject
Computer science
Mathematics
Physics
Ordinal
==, <, >
Degree
Bachelor
Master
Licentiate
Doctor
Interval
==, <, >, -
Temperature
10 °C
20 °C
10 °F
Ratio
==, <, >, -, /
Weight
0 kg
10 kg
20 kg
Definition of distance metric
A distance function is metric if the following conditions
are met for all data points x, y, z:
• All distances are non-negative:
d(x, y) ≥ 0
• Distance to point itself is zero:
d(x, x) = 0
• All distances are symmetric:
d(x, y) = d(y, x)
• Triangular inequality:
d(x, y)  d(x, z) + d(z, y)
Common distance metrics
Xj = (xj1, xj2, …, xjp)
dij = ?
Xi = (xi1, xi2, …, xip)
• Minkowski distance
d (i , j ) 
q
x i1  x j 1
q
1st dimension
 xi 2  x j 2
q
2nd dimension
 ...  x ip  x jp
q
pth dimension
• Euclidean distance
q=2
d (i , j ) 
x i1  x j 1
2
 xi 2  x j 2
2
 ...  x ip  x jp
• Manhattan distance
q=1
d ( i , j )  x i1  x j 1  x i 2  x j 2  ...  x ip  x jp
2
Distance metrics example
10
2D example
x1 = (2,8)
x2 = (6,3)
X1 = (2,8)
Euclidean distance
5
5
d (1, 2 ) 
X2 = (6,3)
26
2
 83
2
Manhattan distance
d (1, 2 )  2  6  8  3  9
0
5
4
10

41
Chebyshev distance
In case of q  , the distance equals to the maximum
difference of the attributes. Useful if the worst case must
be avoided:

d  ( X , Y )  lim   x i  y i
q 
 i 1
n
 max
x
1
q



1
q
 y1 , x 2  y 2 ,  , x n  y n
Example:
d  ( 2 ,8 ), ( 6 , 3 )   max  2  6 , 8  3   max  4 , 5   5

Hierarchical clustering
Cost functions
Single link: the smallest distance between vectors in
clusters i and j:
d (C i , C j ) 
d ( xi , x j )
min
x i C i , x j C j
,
Complete-link: the largest distance between vectors in
clusters i and j:
d (C i , C j ) 
d ( xi , x j )
max
x i C i , x j C
j,
Average link: the average distance between vectors in
clusters i and j:
d (C i , C j ) 
1
Ci C
  d (x
j
x i C i x j C
j
i
,x j )
Single Link
Complete Link
Average Link
Cost function example
[Theodoridis, Koutroumbas, 2006]
1
Data Set
x1
1.1
x2
1.2
x3
1.3
x4
1.4
x5
Single Link:
x1
x2
x3
x4
x5
1.5
x6
x7
Complete Link:
x6
x7
x1
x2
x3
x4
x5
x6
x7
Part II:
Binary data
Hamming Distance
(Binary and categorical data)
•
•
•
•
Number of different attribute values.
Distance of (1011101) and (1001001) is 2.
Distance (2143896) and (2233796)
Distance between (toned) and (roses) is 3.
3-bit binary cube
100->011 has distance 3 (red path)
010->111 has distance 2 (blue path)
Hard thresholding of centroid
(0.40, 0.60, 0.75, 0.20, 0.45, 0.25)
T h resh o ld
0 .4 0
0
0 .6 0
1
0 .7 5
0
0 .2 0
0 .4 5
0
0
0 .0
1
0 .2 5
0 .5
1 .0
Hard and soft centroids
Bridge (binary version)
1.60
1.55
co n v erg en ce
B inary
1.50
1.45
1.40
S oft
1.35
1.30
co n v erg en ce
1.25
0
1
2
3
4
5
6
Iteration s
7
8
9
10
Distance and distortion
General distance function:
 K
d 
L d x i , c j    x ik  c jk 


 k 1



Distortion function:
E X ,C   1
N
N
 L d x i , c p i 
i 1
1
d
Distortion for binary data
Cost of a single attribute:
D jk  q
d
jk c jk

 r jk 1  c jk

d
The number of zeroes is qjk, the number of ones is
rjk and cjk is the current centroid value for
variable k of group j.
Optimal centroid position
Optimal centroid position depends on the metric.
Given parameter:
  1 / d  1
The optimal position is:

c jk 
r jk


q jk  r jk
Example of centroid location
1.00
0.90
c
0.80
0.70
q=10,r=90
q=20,r=80
q=30,r=70
0.60
0.50
q=40,r=60
6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0
d
Centroid location
d =  (  = 0)
d = 3 (  = 0.5)
d = 2 (  = 1)
d = 1
c = 0.5
c = 0.37
c = 0.25
c = 0
0.0
q = 75
1.0
r = 25
Part III:
Categorical data
Categorical clustering
Three attributes
t1 (Godfather
II)
t2 (Good
Fellas)
t3 (Vertigo)
t4 (N by NW)
director actor
genre
Coppol
De
Crime
a
Niro
Scorses
De
Crime
e
Niro
Hitchco Stewar Thriller
ck
t
Hitchco Grant Thriller
Categorical clustering
Sample 2-d data: color and shape
Model A
Model B
Model C
Categorical clustering
Methods:
•
•
•
•
•
•
k-modes
k-medoids
k-distributions
k-histograms
k-populations
k-representatives
Histogram-based methods:
Entropy-based cost functions
Category utility:
Entropy of data set:
Entropies of the clusters relative to the data:
Iterative algorithms
K-modes clustering
Distance function
K-modes clustering
Prototype of cluster
K-medoids clustering
Prototype of cluster
Vector with minimal total distance to every other
2
A
C
E
3
B
C
F
2
Medoid:
B
D
G
B
C
F
2+3=5 2+2=4 2+3=5
K-medoids
Example
K-medoids
Calculation
K-histograms
D 2/3
F 1/3
K-distributions
Cost function with ε addition
Example of cluster allocation
Change of entropy
Problem of non-convergence
Non-convergence
Results with Census dataset
Part IV:
Text data
Applications of text clustering
•
•
•
•
Query relaxation
Spell-checking
Automatic categorization
Document similarity
Query relaxation
Current solution
Alternate solution
Matching suffixes from database
From semantic clustering
Spell-checking
Word kahvila (café) but
with one correct and
two incorrect spellings
Automatic categorization
Category by
clustering
String clustering
• The similarity between every string pair is calculated
as a basis for determining the clusters
• A similarity measure is required to calculate the
similarity between two strings.
Approximate string
matching
Semantic
similarity
Document clustering
Motivation:
– Group related documents based on their content
– No predefined training set (taxonomy)
– Generate a taxonomy at runtime
Clustering Process:
– Data preprocessing: remove stop words, stem,
feature extraction and lexical analysis
– Define cost function
– Perform clustering
Exact string matching
• Given a text string T of length n and a pattern
string P of length m, the exact string matching
problem is to find all occurrences of P in T.
• Example: T=“AGCTTGA”
• Applications:
– Searching keywords in a file
– Searching engines (like Google)
– Database searching
P=“GCT”
Approximate string matching
Determine if a text string T of length n and a
pattern string P of length m partially matches.
– Consider the string “approximate”. Which of these are partial matches?
aproximate approximately appropriate proximate approx approximat
apropos approxximate
– A partial match can be thought of as one that has k differences from the
string where k is some small integer (for instance 1 or 2)
– A difference occurs if the string1.charAt(j) != string2.charAt(j) or if
string1.charAt(j) does not appear in string2 (or vice versa)
 The former case is known as a revise difference, the latter is a delete
or insert difference.
 What about two characters that appear out of position? For instance,
approximate vs. apporximate?
Approximate string matching
Keanu Reeves
Samuel Jackson
Arnold Schwarzenegger
H. Norman Schwarfkopf
Schwarrzenger
Bernard Schwartz
…
Query errors



Limited knowledge about data
Typos
Limited input device (cell phone) input
Data errors



Typos
Web data
OCR
Similarity functions:
 Edit distance
 Q-gram
 Cosine
Edit distance
Levenhstein distance
• Given two strings T and P, the edit distance is
the minimum number of substitutions,
insertion and deletions, which will transform
the string T into P.
• Time complexity by dynamic programming:
O(nm)
Edit distance
1974
t
m
p
0
1
2
3
t
1
0
1
2
e
2
1
2
2
m
3
2
1
2
p
4
3
2
1
Dynamic programming:
m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)}
d(i,j) =0 if i=j, d(i,j)=1 else
Q-grams
b i n g o n
2-grams
Fixed length (q)
ed(T, P) <= k, then
# of common grams >= # of T grams – k * q
Q-grams
T = “bingo”, P = “going”
gram1 = {#b, bi, in, ng, go, o#}
gram2 = {#g, go, oi, in, ng, g#}
Total(gram1, gram2) = {#b,bi,in,ng,go,o#,#g, go,oi,in,ng,g#}
|common terms difference|= sum{1,1,0,0,0,1,1,0,1,0,0,1}
gram1.length = (T.length + (q - 1) * 2 + 1) – q
gram2.length = (P.length + (q - 1) * 2 + 1) - q
L = gram1.length + gram2.length=12
Similarity = (L- |common terms difference| )/ L =0.5
Cosine similarity
• Two vectors A and B,θ is represented using a dot
product and magnitude as
• Implementation:
Cosine similarity = (Common Terms) / (sqrt(Number
of terms in String1) + sqrt(Number of terms in
String2))
Cosine similarity
T = “bingo right”, P = “going right”
T1 = {bingo right}, P1 = {going right}
L1 = unique(T1).length;
L2 = unique(P1).length;
Unique(T1&P1) = {bingo right going}
L3 = Unique(T1&P1) .length;
Common terms = (L1+L2)-L3;
Similarity = common terms / (sqrt(L1)*sqrt(L2))
Dice coefficient
• Similar with cosine similarity
• Dices coefficient = (2*Common Terms) /
(Number of terms in String1 + Number of
terms in String2)
Similarities for sample data
Compared Strings
Pizza Express Café
Pizza Express
Lounasravintola Pinja Ky – ravintoloita
Lounasravintola Pinja
Kioski Piirakkapaja
Different
Kioski Marttakahvio
Kauppa Kulta Keidas
Different
Kauppa Kulta Nalle
Ravintola Beer Stop Pub
Baari, Beer Stop R-kylä
Ravintola Foxie s
Bar Foxie Karsikko
Play baari
Ravintola Bar Play – Ravintoloita
Edit
Q-gram Q-gram Q-gram Cosine
Q=3
Q=4 distance
distance Q=2
72%
79%
74%
70%
82%
54%
68%
67 %
65%
63 %
47%
45%
33%
32%
50%
68%
67%
63 %
60%
67%
39%
42%
36%
31%
50%
31%
25%
15%
12%
24%
21%
31%
17%
8%
32%
Thesaurus-based WordNet
WordNet
An extensive lexical network for the English language
• Contains over 138,838 words.
• Several graphs, one for each part-of-speech.
• Synsets (synonym sets), each defining a semantic sense.
• Relationship information (antonym, hyponym, meronym …)
• Downloadable for free (UNIX, Windows)
• Expanding to other languages (Global WordNet Association)
• Funded >$3 million, mainly government (translation interest)
• Founder George Miller, National Medal of Science, 1991.
moist
watery
parched
wet
dry
damp
anhydrous
arid
synonym
antonym
Example of WordNet
object
artifact
instrumentality
conveyance, transport
wheeled vehicle
car, auto
ware
table ware
vehicle
automotive, motor
article
bike, bicycle
truck
cutlery, eating utensil
fork
Examples of probabilities
Entity (40%)
Inanimate-object (17 %)
Natural-object (1.6%)
Geological-formation (0.17%)
Natural-elevation (0.011%)
Hill (0.0019%)
Shore ( 0.008%)
Coast (0.002%)
Hierarchical clustering by WordNet
Performance of WordNet
Word Pair
Human
Judgment
Edge Counting
Based Measures
Path
WUP
Information Content Based
Measures
LIN
Jiang&Conrath
Car
Automobile
3.92
1
1
1
1
Gem
Journey
Boy
Coast
Asylum
Jewel
Voyage
Lad
Shore
Madhouse
3.84
3.84
3.76
3.70
3.61
1
0.97
0.97
0.97
0.97
1
0.92
0.93
0.91
0.94
1
0.84
0.86
0.98
0.97
1
0.88
0.88
0.99
0.97
Magician Wizard
3.50
1
1
1
1
Midday
Furnace
Food
Bird
Bird
3.42
3.11
3.08
3.05
2.97
1
0.81
0.81
0.97
0.92
1
0.46
0.22
0.94
0.84
1
0.23
0.13
0.6
0.6
1
0.39
0.63
0.73
0.73
Noon
Stove
Fruit
Cock
Crane
Part V:
Time series
Clustering of time-series
Dynamic Time Warping
Align two time-series
by minimizing
distance of the
aligned observations
Solve by
dynamic
programming!
Example of DTW
Prototype of a cluster



Sequence c that minimizes E(Sj,c) is called a
Steiner sequence.
Good approximation to Steiner problem, is to
use medoid of the cluster (discrete median).
Medoid is such a time-series in the cluster that
minimizes E(Sj,c).
Calculating the prototype


Can be solved by dynamic programming.
Complexity is exponential to the number of
time-series in a cluster.
Averaging heuristic



Calculate the medoid sequence
Calculate warping paths from the medoid to all
other time series in the cluster
New prototype is the average sequence over
warping paths
Local search heuristics
Example of the three methods
E(S) = 159
E(S) = 138
E(S) = 118
• LS provides better fit in terms of the Steiner cost
function.
• It cannot modify sequence length during the iterations.
In datasets with varying lengths it might provide better
fit, but non-sensitive prototypes
Experiments
Part VI:
Other clustering problems
Clustering of GPS trajectories
Density clusters
Swim
hall
Walking
street
Market
place
Science
park
Shop
Homes
of users
Image segmentation
Objects of different colors
Literature
1.
S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd
edition, 2006.
2.
P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids",
Signal Processing: Image Communication, 14 (9), 677-681, 1999.
3.
I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE
Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, 3499-3502, October
2004.
Literature
Modified k-modes + k-histograms:
M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes
Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507,
March, 2007.
ACE:
K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific
and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005.
ROCK:
S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”,
Information Systems, Vol. 25, No. 5, pp. 345-366, 200x.
K-medoids:
L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John
Wiley Sons, New York, 1990.
K-modes:
Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data
mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998.
K-distributions:
Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int.
Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007.
K-histograms:
Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering
Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.