Clustering under Constraints with Genetic Algorithms

Download Report

Transcript Clustering under Constraints with Genetic Algorithms

Clustering under
Constraints with
Genetic Algorithms
by
Albert Ali Salah
Stanislav Redman
Gabriella Kovacs
Outline
• Definition of the problem
• Background on genetic algorithms
• Case study: Workgroup assignment
• Results
Clustering under Constraints
• N multi-dimensional data items
• A bunch of soft constraints
• (A bunch of hard constraints)
• The problem: Clustering the data
points so that the hard constraints
are satisfied, and the soft constraints
are optimized.
Constrained Clustering
• Constrained clustering is an unsupervised
learning technique, where some data
items are known to be in the same
cluster, and some are known to be in
different clusters.
• Clustering under constraints is an
optimization problem (I saw Karp in the
elevator, and he said it’s probably NPcomplete)
Genetic Algorithms
• A GA is essentially a heuristic
random search tool
• Has no rigorous mathematical
principle, no one knows why it
works
• Used frequently in soft constraint
optimization, rarely in clustering
Details You All Know
• Solutions are ‘coded’ into simple, DNAlike structures called chromosomes
• A fitness function is supplied to evaluate
the quality of solutions
• The algorithm works on a population of
individuals
• There is a Genetic Algorithm package
written for the object-oriented Dolphin
Smalltalk environment
Genetic Algorithm Flowchart
I nit ial Populat ion
Out put Bes t
I ndiv idual
Y es
End Criteria
R eached?
No
Selec tion
C ross -ov er
Mut at ion
N ew Populat ion
Case Study: Santa Fe
• Aim: Cluster people such that:
– Groups are balanced in number of
students
– Each group consists of people with
similar interests
– Each group has some people with basic
skills
– Each group possesses enough
knowledge in its areas of interest
Problem 1: Representation
• A good GA representation is:
– unambiguous
– short (k bits means 2k search space)
– smooth with respect to fitness
landscape
– robust to mutations
– free of preferential bias
– simple to decode
Representation
• 01101001010010101001010…
Three bits code
the group number
• 01101001010010101001010…
1
2
3
4…
The position indicates
the student number
Problem 2: Fitness
• A good fitness function is:
– between 0 (awful) and 1 (optimal)
– a correct ordering of individuals with
respect to their closeness to the optimal
solution
– informative, and indicative of relative
fitness
– pragmatic about the boundary
conditions
– simple and fast to calculate
Composite Fitness
• Assume there are n different, possibly
independent fitness criteria. Let f1, f2,… ,fn be the
individual fitness functions that order the
solutions according to individual criteria. The total
fitness function is
where i are coefficients to be determined
f1 : Interest Term
N : number of students
M : number of groups
S : number of interests
pi : interest vector of student i
gj : mean interest vector of group j
ij : Kronecker delta
N
f1 
M
9SN   ( p  g )
i 1 j 1
9SN
i
j
2

ij
Problem with f1
• 9SN is a too big normalization factor, all decent
individuals (with small distances from the mean)
will have f1 very close to 1.
• General Solution:
replace
max dist   dist
maxdist
N M
with
p
g
i
j ij
i 1 j 1
 (
f1  0.8
z average _ dist

SN
)2
f2 : Balance Term
N : number of students
M : number of groups
ni : number of students in group j
f2 
N
N 2
 (n j  )
j 1
M
2 M
N
2
f3 : Basic Skills Term
M : number of groups
B : number of basic skills
bik: kth skill of student i
ij : Kronecker delta
M
f3 
B
9MB   (4  arg max(bik ij ))
j 1 k 1
i
9MB
2
f4 : Knowledge Term
M : number of groups
S : number of interests
hik: kth knowledge term of student i
ij : Kronecker delta
jk: 1 if kth interest term is among the first
three interests of group j, 0 otherwise.
M
f4 
S
27M   (4  arg max(hik  ij  jk ))2
j 1 k 1
i
27M
GA parameters
• Population size: 100
• Generations: 30
• Crossover probability: 0.4 (single
point)
• Mutation probability: 0.001
• Equal coefficients
Some entertaining
facts about the dataset

Basic skills
Average
Experts
Beginners
Mathematics
2.83
9
4
Programming
2.75
14
11
English
3.10
19
1
Statistics
2.87
8
1
Interests
Quantum Consciousness
Anthropology
Philosophy
Neuroscience
Psychology
Social Networks
Physics
Cognitive Science
Optimization
Economics
Information Theory
Neural Nets & Simulation
Biology
Evolution
Multi-Agent Systems
Computer Science
Self-organization
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Knowledge
Quantum Consciousness
Anthropology
Neuroscience
Social Networks
Psychology
Cognitive Science
Philosophy
Economics
Information Theory
Self-organization
Biology
Neural Nets & Simulation
Multi-Agent Systems
Optimization
Physics
Evolution
Computer Science
0.0
0.5
1.0
1.5
2.0
2.5
3.0
TOP 10 knowledge-seeking people
Irina
Anton
Mourad
Zoltan
Anukool
Angel
Lyudmila
Mianlai
Aaron
Arthur
TOP 10 knowledgeable people
Anton
Louise
Arndt
Angel
Suzanne
Mark
Nilanjana
Wojciech
Albert
Aaron
Some serious results
Clustering of interest vectors with
• Nearest neighbor
• Furthest neighbor
• Average linkage
• Ward linkage
Nearest neighbor
d (r, s)  min(dist( xri , xsj )),i 1: nr , j 1: ns
FITNESS TERMS: 0,37352071 0,847012823 0,722222222 0,916006652
GROUP 1: Natalia, Nilanjana, Angel, Arndt, Alexander, Wojciech, Frederic, Jason,
Gerard, Ferenc, Sergey, Milica, Zoltan, Bartlomiej, Aaron, Pau, Sergey,
Jasper, Matthew, Mark, Eva, Volodymyr, Victor, Oleksiy, Anukool, Hilary,
Lyudmila, Alex, Vaclav, Anton, Mourad, Nicholas, Arthur, Carolyn,
Stanislav, Denis, Suzanne, Albert, Lisa, Vadim, Pavel, Sergiy, Valentin,
Mianlai, Gordan
Interests: Self-organization (2,98) Evolution (2,8) Computer Science (2,78)
GROUP 2: Louise
Interests: Anthropology (4) Biology (4) Cognitive Science (4)
GROUP 3: Tatyana
Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)
GROUP 4: Gabriella
Interests: Computer Science (4) Information Theory (4) Optimization (4)
GROUP 5: Ana-Maria
Interests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 6: Angelica
Interests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: Christophe
Interests: Cognitive Science (4) Neural Nets & Simulation (4) Psychology (4)
GROUP 8: Irina
Interests: Cognitive Science (4) Computer Science (4) Information Theory (4)
Furthest neighbor
d (r, s)  max(dist( xri , xsj )),i 1: nr , j 1: ns
FITNESS TERMS: 0,926035503 0,887127441 0,958333333 0,964728892
GROUP 1: Hilary, Angel, Mark, Mourad, Jason
Interests: Psychology (3,8) Evolution (3,6) Anthropology (3,2)
GROUP 2: Bartlomiej, Louise, Alexander, Matthew, Valentin, Angelica, Victor
Interests: Evolution (3,43) Multi-Agent Systems (3,29) Social Networks (3,29)
GROUP 3: Suzanne, Aaron, Alex, Arndt, Wojciech
Interests: Evolution (3,57) Biology (3,2929) Self-organization (3,14285714)
GROUP 4: Lisa, Gerard
Interests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 5: Sergiy, Albert, Christophe
Interests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)
GROUP 6: Natalia, Nilanjana, Lyudmila, Vaclav, Anton, Frederic, Arthur, Ferenc, Stanislav,
Milica, Denis, Sergey, Jasper, Pavel, Mianlai, Volodymyr, Gabriella, Oleksiy,
Anukool
Interests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, Gordan
Interests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)
GROUP 8: Irina, Zoltan, Tatyana, Carolyn
Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
Average linkage
1
d (r , s) 
nr ns
nr
ns
 dist( x
i 1
j 1
ri
, xsj )
FITNESS TERMS: 0,821745562 0,879219281 0,902777778 0,951247491
GROUP 1: Natalia, Nilanjana, Angel, Wojciech, Frederic, Jason, Ferenc, Milica, Aaron,
Sergey, Jasper, Mark, Volodymyr, Gabriella, Oleksiy, Hilary, Lyudmila, Vaclav,
Anton, Mourad, Arthur, Stanislav, Denis, Suzanne, Pavel, Mianlai
Interests: Self-organization (3,15) Multi-Agent Systems (3,04) Computer Science (3)
GROUP 2: Anukool
Interests: Computer Science (4) Neuroscience (4) Optimization (4)
GROUP 3: Bartlomiej, Lisa, Alexander, Matthew, Valentin, Gerard, Victor
Interests: Evolution (3,57) Biology (3,29) Self-organization (3,14)
GROUP 4: Ana-Maria
Interests: Social Networks (4) Cognitive Science (3) Multi-Agent Systems (3)
GROUP 5: Pau, Alex, Arndt, Vadim, Eva, Nicholas, Sergey, Gordan
Interests: Information Theory (2,625) Physics (2,625) Self-organization (2,625)
GROUP 6: Angelica, Louise
Interests: Cognitive Science (4) Computer Science (4) Multi-Agent Systems (4)
GROUP 7: Sergiy, Albert, Christophe
Interests: Cognitive Science (3,33) Neural Nets & Simulation (3,333) Biology (3)
GROUP 8: Irina, Zoltan, Tatyana, Carolyn
Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
Ward linkage
d (r, s)  nr ns dist( xr , xs )2 /(nr  ns )
FITNESS TERMS: 0,968195266 0,891630074 0,972222222 0,965034915
GROUP 1: Lisa, Alex, Arndt, Frederic, Gerard
Interests: Self-organization (3,6) Biology (3,4) Evolution (3,4)
GROUP 2: Pau, Vadim, Ana-Maria, Eva, Nicholas, Sergey, Gabriella, Gordan
Interests: Physics (2,625) Self-organization (2,625) Computer Science (2,5)
GROUP 3: Bartlomiej, Matthew, Valentin, Alexander
Interests: Economics (3,25) Evolution (3,25) Biology (3)
GROUP 4: Louise, Mianlai, Volodymyr, Victor, Angelica
Interests: Computer Science (4,) Multi-Agent Systems (4,) Self-organization (3,8)
GROUP 5: Sergiy, Albert, Christophe
Interests: Cognitive Science (3,33) Neural Nets & Simulation (3,33) Biology (3)
GROUP 6: Stanislav, Natalia, Denis, Sergey, Vaclav, Anton, Pavel, Ferenc, Milica, Oleksiy
Interests: Computer Science (3,4) Neural Nets & Simulation (3,4) Economics (3,3)
GROUP 7: Irina, Zoltan, Tatyana, Carolyn
Interests: Quantum Consciousness (3,75) Cognitive Science (3,5) Computer Science (3,5)
GROUP 8: Hilary, Lyudmila, Nilanjana, Angel, Wojciech, Mourad, Jason, Arthur, Suzanne,
Aaron, Jasper, Mark, Anukool
Interests: Biology (3,38) Evolution (3,38) Self-organization (3,23)
FITNESS TERMS:0,988905325 0,845403674 0,989583333 0,981469795
GROUP 1
Self-organization (4) Neural Nets & Simulation (3,6) Physics (3,4)
Arndt, Tatyana, Mianlai, Sergey, Zoltan
GROUP 2
Computer Science (2,56) Neural Nets & Simulation (2,56) Evolution (2,44)
Denis, Pau, Alex, Ana-Maria, Lisa, Vadim, Sergiy, Eva, Milica
GROUP 3
Computer Science (3,1) Multi-Agent Systems (3,1) Self-organization (2,9)
Stanislav, Natalia, Nilanjana, Gordan, Mourad, Gerard, Ferenc, Victor, Valentin, Oleksiy
GROUP 4
Self-organization (3,43) Evolution (3,14) Psychology (3)
Suzanne, Lyudmila, Angel, Wojciech, Mark, Anton, Nicholas
GROUP 5
Cognitive Science (3) Biology (2,83) Evolution (2,67)
Christophe, Aaron, Hilary, Albert, Alexander, Frederic
GROUP 6
Economics (3,33) Self-organization (3) Computer Science (2,67)
Bartlomiej, Sergey, Jasper, Vaclav, Pavel, Gabriella
GROUP 7
Biology (3,75) Evolution (3,5) Self-organization (3,5)
Matthew, Angelica, Louise, Arthur
GROUP 8
Computer Science (3,2) Information Theory (3,2) Philosophy (3,2)
Anukool, Irina, Jason, Volodymyr, Carolyn
Comparison of results
Balance
Interests
Basic Skills
Knowledge
Nearest Neighbour
0,37
0,85
0,72
0,92
Furthest Neighbour
0,93
0,89
0,96
0,96
Average Linkage
0,82
0,88
0,90
0,95
Ward Linkage
0,97
0,89
0,97
0,97
GA
0,99
0,85
0,99
0,98
GOOD BYE, CSSS 2002