HCS Clustering Algorithm - Computer Science @ UC Davis

Download Report

Transcript HCS Clustering Algorithm - Computer Science @ UC Davis

HCS Clustering Algorithm

A Clustering Algorithm Based on Graph Connectivity

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 2

The Problem

• Clustering: – Group elements into subsets based on similarity pairs of elements between • Requirements: – Elements in the

same

other – Elements in

different

other cluster are highly similar to each clusters have low similarity to each • Challenges: – Large sets of data – Inaccurate and noisy measurements ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 3

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 4

HCS Algorithm Overview

H

ighly

C

onnected

S

ubgraphs Algorithm – Uses graph theoretic techniques • Basic Idea – Uses similarity information to construct a similarity graph – Groups elements that are highly connected each other with ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 6

HCS: Main Players

• Similarity Graph – Nodes correspond to elements (genes) – Edges connect similar elements (those whose similarity value is above some threshold) gene 2 gene 1 gene 3 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle Gene 1 Gene 1 Gene 2 similar to gene 2 similar to gene 3 similar to gene 3 7

HCS: Main Players

• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity

k

(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8

HCS: Main Players

• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity

k

(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 9

HCS: Main Players

• Edge Connectivity – Minimum number of edges whose removal results in a disconnected graph gene 2 gene 3 Must remove 3 edges to disconnect graph, thus has an edge connectivity

k

(G) = 3 gene 1 gene 4 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 10

HCS: Main Players

• Highly Connected Subgraphs – Subgraphs whose edge connectivity exceeds half the number of nodes gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle

Entire Graph

Nodes = 8 Edge connectivity = 1 Not HCS!

11

HCS: Main Players

• Highly Connected Subgraphs – Subgraphs whose edge connectivity exceeds half the number of nodes gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle

Sub Graph

Nodes = 5 Edge connectivity = 3 HCS!

12

HCS: Main Players

• Cut – A set of edges whose removal disconnects the graph gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 13

HCS: Main Players

• Minimum Cut – A cut with a

minimum

number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 14

HCS: Main Players

• Minimum Cut – A cut with a

minimum

number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 15

HCS: Main Players

• Minimum Cut – A cut with a

minimum

number of edges gene 2 gene 5 gene 8 gene 3 gene 6 gene 1 gene 4 gene 7 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 16

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 17

HCS: Algorithm (by example)

5 2 3 4 6 1 12 11 10 7 find and remove a minimum cut 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 18

HCS: Algorithm (by example)

Highly Connected!

2 1 12 3 4 11 10 are the resulting subgraphs highly connected?

9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5 6 7 8 19

HCS: Algorithm (by example)

Cluster 1 2 1 12 3 4 11 10 repeat process on non-highly connected subgraphs 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 5 6 7 8 20

HCS: Algorithm (by example)

5 Cluster 1 2 1 12 3 11 4 10 6 7 find and remove a minimum cut 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 21

HCS: Algorithm (by example)

Cluster 1 2 1 12 3 5 4 6

Highly Connected!

Highly Connected!

10 7 11 are the resulting subgraphs highly connected?

9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 22

HCS: Algorithm (by example)

5 Cluster 1 2 1 12 3 11 4 Cluster 3 10 6 Cluster 2 7 resulting clusters 9 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 8 23

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else HCS( H i ) } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 24

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Find a minimum cut in graph G . } HCS( H i ) This returns a set of subgraphs { H 1 , … , H t } resulting from the removal of the cut set.

} ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 25

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] if k( H i ) > n ÷ 2 return H i else HCS( H i ) { For each subgraph… } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 26

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else If the subgraph is highly connected, then return that } HCS( H i ) subgraph as a cluster. (Note: k( H i ) denotes edge connectivity of graph H i , n denotes number of nodes) ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 27

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Otherwise, repeat the algorithm on the subgraph.

HCS( H i ) (recursive function) } This continues until there are no more subgraphs, and all clusters have been found.

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 28

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H i else Running time is bounded by 2N × f( n, m ) where N is HCS( H i ) the number of clusters found, and f( n, m ) is the time } complexity of computing a minimum cut in a graph with n nodes and m edges.

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 29

HCS: Algorithm

HCS( G ) { MINCUT( G ) = { H 1 , … , H t } for each H i , i = [ 1, t ] { if k( H i ) > n ÷ 2 return H

Deterministic for

i : takes O(nm) steps else where n is the number HCS( H i ) number of edges } } ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 30

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 31

HCS: Properties

• Homogeneity – Each cluster has a diameter of at most 2 •

Distance

is the minimum length path between two nodes – Determined by number of EDGES traveled between nodes •

Diameter

is the longest distance in the graph – Each cluster is at least half as dense as a clique • Clique is a graph with maximum possible edge connectivity a f e Dist( a, d ) = 2 Dist( a, e ) = 3 Diam( G ) = 4 b c d clique ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 32

HCS: Properties

• Separation – Any non-trivial split is unlikely to have diameter of two – Number of edges removed by each iteration is linear in the size of the underlying subgraph • Compared to quadratic number of edges within final clusters • Indicates separation unless sizes are small • Does not imply number of edges removed overall ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 33

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 34

HCS: Improvements

2 3 4 1 12 11 10 Choosing between cut sets ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 6 7 8 35

HCS: Improvements

2 3 4 1 12 11 10 6 7 8 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 36

HCS: Improvements

2 3 4 1 12 11 10 6 7 8 ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 37

HCS: Improvements

• Iterated HCS – Sometimes there are multiple minimum cuts to choose from • Some cuts may create “singletons” or nodes that become disconnected from the rest of the graph – Performs several iterations of HCS until no new cluster is found (to find best final clusters) • Theoretically adds another O(n) factor to running time, but typically only needs 1 – 5 more iterations ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 38

HCS: Improvements

• Remove low degree nodes first – If node has low degree, likely will just be separated from rest of graph – Calculating separation for those nodes is expensive – Removal helps eliminate unnecessary iterations and significantly reduces running time ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 39

Presentation Outline

• The Problem • HCS Algorithm Overview – Main Players – General Algorithm – Properties – Improvements • Conclusion ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 40

Conclusion

• Performance – With improvements, can handle problems with up to thousands of elements in reasonable computing time – Generates clusters with high homogeneity and separation – More robust (responds better when noise is introduced) than other approaches based on connectivity ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 41

References

“A Clustering Algorithm based on Graph Connectivity”

By Erez Hartuv and Ron Shamir March 1999 ( Revised December 1999)

http://www.math.tau.ac.il/~rshamir/papers.html

ECS289A Modeling Gene Regulation • HCS Clustering Algorithm • Sophie Engle 42