Using Sparse Matrix Reordering Algorithms for Cluster

Download Report

Transcript Using Sparse Matrix Reordering Algorithms for Cluster

Using Sparse Matrix
Reordering Algorithms for
Cluster Identification
Chris Mueller
Dec 9, 2004
Visualizing a Graph as a Matrix
Each row and column in the matrix corresponds to
a node in the graph. The nodes are ordered the
same in the rows and columns, so node 10 is
represented by row=10 and col=10.
Each edge between two nodes (a,b) is rendered
as a dot at (i,j) where i is the row for a and j is the
column for b.
The solid diagonal shows the identity relationship
for each node.
Undirected graphs can be rendered as
lower triangles, with each edge is
displayed so that i <= j.
Visually Identifying Clusters
Dense areas in the matrix reveal potential
clusters.
Some dense areas may be in the same
row or column as others, suggesting a
relationship.
Reordering the nodes (rows/cols)
can reduce the noise in the display
and highlight clusters.
(Some) Previous Work
The basic idea of visualizing relational data as a reordered matrix has been
around since the early days of computer science. Some examples are:
The Reorderable Matrix (Bertin, 1981)
GAP Generalized Association Plots (Chen, 2002)
Block Clustering (Hartigan, 1972)
Bertin (1981), Graphics and Graphic Information Processing. From http://www.math.yorku.ca/SCS/Gallery/bright-ideas.html
www.stat.sinica.edu.tw/SLR/PDF/ 吳漢銘-Cluster_Lecture_040206-new.pdf
Sparse Matrices
Matrices are the basic data structure for
most numerical computations:
0
9
4
3
1
3
8
5
2
0
0
8
3
3
1
0
1
9 3 0
4
0 1
8 0
Sparse matrices can be stored in memory
in data structures that are more compact
that 2D arrays:
1
9 3 0
4
0 1
8 0
Sparse matrices are matrices that do not
need explicit values for each element:
The bandwidth is the number
of diagonals required to store
the matrix. In this example,
the bandwidth is 4.
The banded representation stores only the
diagonals that have values:
[ 1 0 1 n 3 0 0 9 n 8 4 n ]
Note that zeros may be
important and cannot always
be excluded from that matrix.
Sparse matrix reordering algorithms reorder
the elements in the matrix to achieve better
use of memory or computational resources:
1
3 9 0
4 0 1
8 0
Swapping column 1 and 2
reduced the bandwidth to 3,
decreased the amount of
storage required by 2
elements, and removed 2
empty elements.
[ n 0 1 1 9 0 0 3 4 8 ]
Sparse Matrix Reordering Algorithms
Bandwidth Minimization: Reverse Cuthill-McKee and King’s Algorithm
RCM(matrix):
Represent the matrix as a graph
Choose a suitable starting node
For each node reachable from the current node:
Output the node
Find all unvisited neighbors
Order them based on increasing degree
Visit them in that order
2
3
6
1
5
7
4
King’s algorithm is similar but it orders based on edges out of the current cluster rather than total edges.
Minimizing Non-Zero Structure: Modified Minimum Degree
5
7
MMD(matrix):
Represent the matrix as a graph
Order nodes based on degree
8
6
9
1
2
4
3
Note that these algorithms are stochastic in the choice of starting nodes and ordering for nodes
with the same degree.
Reordering the COG Database
Basic Protocol:
1. Filter edges based on FASTA score
1. cmp2 is original data, cmp90, cmp200 are filtered
2. Shuffle the data
3. For each sorted and shuffled graph
1. Identify the connected components
2. Apply RCM and King’s algorithm to each component
3. Apply MMD to the entire graph
Results by the Numbers
Stats
Bandwidth
Graph
Edges Vertices CC
Starting CC
RCM
King
MMD
cmp200.49
13789
1891 471
86
68
64
64
1655
cmp90.49
29314
1930 124
807
94
90
89
1919
cmp2.49
91178
1931
2
1696 1813 1338
1329
1755
cmp200.49.shuffle 13789
1891 471
1750
62
64
63
1869
cmp90.49.shuffle
29314
1930 124
1786
116
90
95
1913
cmp2.49.shuffle
91178
1931
2
1893 1828 1316
1468
1874
(but the pictures show sooo much more…)
Visualization Key
Green lines show the
extent of a COG family.
Black dots show the
elements in the family.
Red dots are edges
Blue dots are the
COG families for the
node in column j.
Both axes have the nodes in the same order
Discussion
• All algorithms worked as expected
• However, the matrix ordering goals were too simple
to yield good cluster clusters.
• Possible Future Work
– Extended algorithms that allow more information
to be used
– Exploit features of ordering strategies to do a
second pass that generates better clusters?
– Hypergraph reordering
• (demo of reordering by hand)