Transcript Slide 1

Lecture 3
1. Different centrality measures of nodes
2. Hierarchical Clustering
3. Line graphs
1. Centrality measures
Within graph theory and network analysis, there are
various measures of the centrality of a vertex within a
graph that determine the relative importance of a
vertex within the graph.
We will discuss on the following centrality measures:
•Degree centrality
•Betweenness centrality
•Closeness centrality
•Eigenvector centrality
•Subgraph centrality
Degree centrality
Degree centrality is defined as the number of links
incident upon a node i.e. the number of degree of the
node
Degree centrality is often interpreted in terms of the
immediate risk of the node for catching whatever is
flowing through the network (such as a virus, or some
information).
Degree centrality of the
blue nodes are higher
Betweenness centrality
The vertex betweenness centrality BC(v) of a vertex v is
defined as follows:
Here σuw is the total number of shortest paths between
node u and w and σuw(v) is number of shortest paths
between node u and w that pass node v
Vertices that occur on many shortest paths between other
vertices have higher betweenness than those that do not.
Betweenness centrality
σuw
a
c
d
b
f
e
Betweenness centrality of
node c=6
Betweenness centrality of
node a=0
σuw(v)
σuw/σuw(v)
(a,b)
1
0
0
(a,d)
1
1
1
(a,e) 1
1
1
(a,f)
1
1
1
(b,d) 1
1
1
(b,e) 1
1
1
(b,f)
1
1
1
(d,e) 1
0
0
(d,f)
1
0
0
(e,f)
1
0
0
Calculation for node c
Betweenness centrality
•Nodes of high
betweenness centrality
are important for
transport.
•If they are blocked,
transport becomes less
efficient and on the
other hand if their
capacity is improved
transport becomes
more efficient.
•Using a similar
concept edge
betweenness is
calculated.
Hue (from red=0 to blue=max)
shows the node betweenness.
http://en.wikipedia.org/wiki/Between
ness_centrality#betweenness
Closeness centrality
The farness of a vortex is the sum of the shortest-path
distance from the vertex to any other vertex in the graph.
The reciprocal of farness is the closeness centrality (CC).
1
CC (v) 
 d ( v, t )
t V \ v
Here, d(v,t) is the shortest distance between vertex v and
vertex t
Closeness centrality can be viewed as the efficiency of a
vertex in spreading information to all other vertices
Eigenvector centrality
Let A is the adjacency matrix of a graph and λ is the largest
eigenvalue of A and x is the corresponding eigenvector then
-----(1)
N×N N×1
|A-λI|=0, where I is an
identity matrix
N×1
The ith component of the eigenvector x then gives the eigenvector
centrality score of the ith node in the network.
From (1)
xi 
1
N
A


j 1
i, j
xj
•Therefore, for any node, the eigenvector centrality score be
proportional to the sum of the scores of all nodes which are
connected to it.
•Consequently, a node has high value of EC either if it is
connected to many other nodes or if it is connected to others that
themselves have high EC
Subgraph centrality
the number of closed
walks of length k starting
and ending on vertex i in
the network is given by
the local spectral
moments μ k (i), which
are simply defined as the
ith diagonal entry of the
kth power of the
adjacency matrix, A:
Subgraph Centrality in Complex
Networks, Physical Review E 71,
056103(2005)
Closed walks can be trivial or
nontrivial and are directly related to
the subgraphs of the network.
Subgraph centrality
01000000000000
10110100000000
01011100000000
01101101000000
00110100000000
01111010000000
M=
00000100001000
00010000100000
Muv = 1 if there is an edge between
nodes u and v and 0 otherwise.
00000001010011
00000000101011
00000010010000
00000000000010
00000000110101
00000000110010
Adjacency matrix
Subgraph centrality
10110100000000
04223211000000
12432311000000
12352310100000
03223211000000
12332501001000
M2 =
01111020010000
01101102010011
(M2)uv for uv represents the
number of common neighbor of the
nodes u and v.
00010000421122
local spectral moment
00000000110101
00000011240122
00000100102011
00000001221042
00000001221123
Table 2.
Summary of
results of eight
real-world
complex
networks.
Hierarchical Clustering
Data is not always
available as binary
relations as in the case of
protein-protein
interactions where we
can directly apply
network clustering
algorithms.
AtpB
AtpG
AtpA
AtpB
AtpG
AtpE
AtpA
AtpE
AtpH
AtpH
AtpH
AtpH
In many cases for
example in case of
microarray gene
expression analysis
the data is
multivariate type.
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Hierarchical Clustering
We can convert multivariate data into networks and can apply
network clustering algorithm about which we will discuss in the
next class.
If dimension of multivariate data is 3 or less we can cluster
them by plotting directly.
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Hierarchical Clustering
Some data reveal good cluster structure when plotted but some
data do not.
Data plotted in 2
dimensions
However, when dimension is more than 3, we can apply
hierarchical clustering to multivariate data.
In hierarchical clustering the data are not partitioned into a
particular cluster in a single step. Instead, a series of partitions
takes place.
Hierarchical Clustering
Hierarchical clustering is a technique that organizes
elements into a tree.
A tree is a graph that has no cycle.
A tree with n nodes can have maximum n-1 edges.
A Graph
A tree
Hierarchical Clustering
Hierarchical Clustering is subdivided into 2 types
1.
agglomerative methods, which proceed by series of fusions of the n objects
into groups,
2.
and divisive methods, which separate n objects successively into finer
groupings.
Agglomerative techniques are more commonly used
Data can be viewed as a single
cluster containing all objects
to n clusters each containing a
single object .
Hierarchical Clustering
Distance measurements
The Euclidean distance between points
and
, in Euclidean n-space, is defined as:
Euclidean distance between
g1 and g2
(10  10) 2  (8  0) 2  (10  9) 2
 0  64  1  8.0622
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
In stead of Euclidean distance correlation can also be used as
a distance measurement.
For biological analysis involving genes and proteins, nucleotide
and or amino acid sequence similarity can also be used as
distance between objects
Hierarchical Clustering
•An agglomerative hierarchical clustering procedure produces
a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn
consists of n single object 'clusters', the last P1, consists of
single group containing all n cases.
•At each particular stage the method joins together the two
clusters which are closest together (most similar). (At the first
stage, of course, this amounts to joining together the two
objects that are closest together, since at the initial stage each
cluster has one object.)
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Differences between methods arise because of the
different ways of defining distance (or similarity)
between clusters.
Hierarchical Clustering
How can we measure distances between clusters?
Single linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = Min { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Complete linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = Max { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Average linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = TAB / ( NA * NB)
Where TAB is the sum of all pair wise distances between objects
of cluster A and cluster B. NA and NB are the sizes of the clusters
A and B respectively.
Total NA * NB edges
Hierarchical Clustering
Average group linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = = Average { d(i,j) : Where observations i and j are in
cluster t, the cluster formed by merging clusters A and B }
Total n(n-1)/2 edges
Hierarchical Clustering
Alizadeh et al.
Nature 403: 503-511
(2000).
Classifying bacteria
based on 16s rRNA
sequences.
Line Graphs
Given a graph G, its line graph L(G) is a graph such that
each vertex of L(G) represents an edge of G; and
two vertices of L(G) are adjacent if and only if their corresponding
edges share a common endpoint ("are adjacent") in G.
Graph G
Vertices in L(G)
constructed
from edges in G
Added
edges in
L(G)
http://en.wikipedia.org/wiki/Line_graph
The line
graph L(G)
Line Graphs
RASCAL: Calculation of Graph Similarity using Maximum Common Edge
Subgraphs
By JOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER
WILLETT2
THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002
The above paper has introduced a new graph similarity
calculation procedure for comparing labeled graphs.
The chemical graphs G1
and G2 are shown in
Figure a,
and their respective line
graphs are depicted in
Figure b.
Line Graphs
Detection of Functional Modules From
Protein Interaction Networks
By Jose B. Pereira-Leal,1 Anton J. Enright,2 and Christos A. Ouzounis1
PROTEINS: Structure, Function, and Bioinformatics 54:49–57 (2004)
A star is transformed into a clique
Transforming a network of proteins
to a network of interactions. a:
Schematic representation
illustrating a graph representation
of protein interactions: nodes
correspond to proteins and edges to
interactions. b: Schematic
representation illustrating the
transformation of the protein graph
connected by interactions to an
interaction graph connected by
proteins. Each node represents a
binary interaction and edges
represent shared proteins. Note that
labels that are not shared
correspond to terminal nodes in (a)