Transcript HCS Clustering Algorithm - Computer Science @ UC Davis
HCS Clustering Algorithm
A Clustering Algorithm Based on Graph Connectivity
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 2
The Problem
• Clustering: – Group elements into subsets based on similarity pairs of elements between • Requirements: – Elements in the
same
other – Elements in
different
other cluster are highly similar to each clusters have low similarity to each • Challenges: – Large sets of data – Inaccurate and noisy measurements ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 3
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 4
HCS Algorithm Overview
•
H
ighly
C
onnected
S
ubgraphs Algorithm – Uses graph theoretic techniques • Basic Idea – Uses similarity information to construct a similarity graph – Groups elements that are highly connected each other with ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 6
HCS: Main Players
• Similarity Graph – Nodes correspond to elements (genes) – Edges connect similar elements (those whose similarity value is above some threshold) gene 2 gene 1 gene 3 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle Gene 1 Gene 1 Gene 2 similar to gene 2 similar to gene 3 similar to gene 3 7
HCS: Main Players
• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity
k
(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8
HCS: Main Players
• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity
k
(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 9
HCS: Main Players
• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity
k
(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 10
HCS: Main Players
• Highly Connected Subgraphs – Subgraphs whose edge connectivity exceeds half the number of nodes gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
Entire Graph
Nodes = 8 Edge connectivity = 1 Not HCS!
11
HCS: Main Players
• Highly Connected Subgraphs – Subgraphs whose edge connectivity exceeds half the number of nodes gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle
Sub Graph
Nodes = 5 Edge connectivity = 3 HCS!
12
HCS: Main Players
• Cut – A set of edges whose removal disconnects the graph gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 13
HCS: Main Players
• Minimum Cut – A cut with a
minimum
number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 14
HCS: Main Players
• Minimum Cut – A cut with a
minimum
number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 15
HCS: Main Players
• Minimum Cut – A cut with a
minimum
number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 16
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 17
HCS: Algorithm (by example)
5 2 3 4 6 1 12 11 10 7 find and remove a minimum cut 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 18
HCS: Algorithm (by example)
Highly Connected!
2 1 12 3 4 11 10 are the resulting subgraphs highly connected?
9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5 6 7 8 19
HCS: Algorithm (by example)
Cluster 1 2 1 12 3 4 11 10 repeat process on non-highly connected subgraphs 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5 6 7 8 20
HCS: Algorithm (by example)
5 Cluster 1 2 1 12 3 11 4 10 6 7 find and remove a minimum cut 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 21
HCS: Algorithm (by example)
Cluster 1 2 1 12 3 5 4 6
Highly Connected!
Highly Connected!
10 7 11 are the resulting subgraphs highly connected?
9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 22
HCS: Algorithm (by example)
5 Cluster 1 2 1 12 3 11 4 Cluster 3 10 6 Cluster 2 7 resulting clusters 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 23
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else HCS( H i ) } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 24
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Find a minimum cut in graph G . } HCS( H i ) This returns a set of subgraphs { H 1 , … , H t } resulting from the removal of the cut set.
} ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 25
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] if k( H i ) > n ÷ 2 return H i else HCS( H i ) { For each subgraph… } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 26
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else If the subgraph is highly connected, then return that } HCS( H i ) subgraph as a cluster. (Note: k( H i ) denotes edge connectivity of graph H i , n denotes number of nodes) ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 27
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Otherwise, repeat the algorithm on the subgraph.
HCS( H i ) (recursive function) } This continues until there are no more subgraphs, and all clusters have been found.
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 28
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Running time is bounded by 2N × f( n, m ) where N is HCS( H i ) the number of clusters found, and f( n, m ) is the time } complexity of computing a minimum cut in a graph with n nodes and m edges.
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 29
HCS: Algorithm
HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H
Deterministic for
i : takes O(nm) steps else where n is the number HCS( H i ) number of edges } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 30
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 31
HCS: Properties
• Homogeneity – Each cluster has a diameter of at most 2 •
Distance
is the minimum length path between two nodes – Determined by number of EDGES traveled between nodes •
Diameter
is the longest distance in the graph – Each cluster is at least half as dense as a clique • Clique is a graph with maximum possible edge connectivity a f e Dist( a, d ) = 2 Dist( a, e ) = 3 Diam( G ) = 4 b c d clique ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 32
HCS: Properties
• Separation – Any non-trivial split is unlikely to have diameter of two – Number of edges removed by each iteration is linear in the size of the underlying subgraph • Compared to quadratic number of edges within final clusters • Indicates separation unless sizes are small • Does not imply number of edges removed overall ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 33
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 34
HCS: Improvements
2 3 4 1 12 11 10 Choosing between cut sets ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 6 7 8 35
HCS: Improvements
2 3 4 1 12 11 10 6 7 8 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 36
HCS: Improvements
2 3 4 1 12 11 10 6 7 8 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 37
HCS: Improvements
• Iterated HCS – Sometimes there are multiple minimum cuts to choose from • Some cuts may create “singletons” or nodes that become disconnected from the rest of the graph – Performs several iterations of HCS until no new cluster is found (to find best final clusters) • Theoretically adds another O(n) factor to running time, but typically only needs 1 – 5 more iterations ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 38
HCS: Improvements
• Remove low degree nodes first – If node has low degree, likely will just be separated from rest of graph – Calculating separation for those nodes is expensive – Removal helps eliminate unnecessary iterations and significantly reduces running time ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 39
Presentation Outline
• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 40
Conclusion
• Performance – With improvements, can handle problems with up to thousands of elements in reasonable computing time – Generates clusters with high homogeneity and separation – More robust (responds better when noise is introduced) than other approaches based on connectivity ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 41
References
“A Clustering Algorithm based on Graph Connectivity”
By Erez Hartuv and Ron Shamir March 1999 ( Revised December 1999)
http://www.math.tau.ac.il/~rshamir/papers.html
ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 42