kanaya.naist.jp

Transcript kanaya.naist.jp

Comparative Genomics
(Network Biology)
Today’s lecture will cover the following three topics
1. On finding clusters in undirected simple graphs: application
to protein complex detection
2. DPClus software tool
3. Introduction to DPClusO
On finding clusters in undirected simple graphs:
application to protein complex detection
Outline
•Introduction
•Some basic concepts
•The proposed algorithm
•The DPClus software
•Results & Discussion
•Conclusions
Introduction
•There is no universal definition of a cluster.
•But clustering is an important issue.
•Consequently there are diverse definitions and various
methods.
•The major purpose of clustering is finding cohesive groups.
•Here, we are going to discuss a graph clustering algorithm.
Introduction
Regarding a graph, a cluster is a subgraph whose nodes are
densely connected with each other compared to their connections
with other nodes in the graph.
This is a flexible definition of a cluster.
Intuitively, we can recognize two clusters in this arbitrary graph.
But it is difficult to draw a big graph revealing its clusters.
Introduction
An E. coli protein-protein interaction network---consisting of 3007
proteins and 11531 interactions (From Mori Lab NAIST, Japan)
Some algorithm is needed to detect locally dense regions……
Introduction
Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa
and Shigehiko Kanaya, “Development and implementation of an
algorithm for detection of protein complexes in large interaction
networks”, BMC Bioinformatics 7:207, April 2006.
Some basic concepts
It is likely that two nodes belong to the same cluster have
more common neighbors than two nodes that are not
Some basic concepts
It is likely that two nodes belong to the same cluster have
more common neighbors than two nodes that are not
Some basic concepts
•The density d of a cluster is the ratio of the
number of edges present in it and the maximum
possible number of edges in it.
•It is easy to realize that d = |E|/|E|max =
2*|E|/|N|*(|N|-1).
•d is a real number ranging from 0 to 1.
Some basic concepts
d=0.9
d=1.0
Density of the total graph = 0.241
The density of the complexes are relatively higher
Some basic concepts
Considering density alone is not enough
g
•Both the graphs consist
of 8 nodes and both are of
density 0.5
f
h
e
d
h
c
d
a
b
c
g
f
e
b
a
•But one of them seems to
be a single cluster while
the other is divided into
two clusters
Such situations can be tackled by keeping track of the periphery
Some basic concepts
The cluster property of any node n with respect to any cluster k
of density dk and size Nk is defined as follows:
cpnk=|Enk|/(dk* |Nk|)
Here, |Enk| is the total number of edges between the node n and
each of the nodes of cluster k.
g
f
d
h
h
c
e
d
a
g
f
e
b
a
b
c
Cluster property of
node f  0.57
Cluster property of
node f = 0.2
The proposed Algorithm
The proposed algorithm is a sequential constructive algorithm:
It initializes the complex/cluster by choosing a seed node.
It then repeatedly add other nodes on the basis of priority and
some conditions.
The major methods of the algorithm
•Choosing a seed node.
•Selecting a priority node.
•Checking necessary conditions before adding a node to a
complex.
The proposed Algorithm
Inputs to the algorithm are:
•The associated matrix of the network.
•A minimum threshold density for the generated clusters.
•A parameter to determine how we separate a complex from
its periphery.
Output of the algorithm are :
Overlapping/non-overlapping complexes whose densities are
more or equal to the given density.
The proposed Algorithm
Input & Initialization
Input an undirected simple graph G.
Set thresholds din and cpin
and initialize cluster ID k = 1.
Termination check
Generate degrees of the nodes of G.
Determine the highest highest node degree (Dh).
Dk= 0
Yes
No
Seed selection
Generate weight of each node of G.
No
highest node weight= 0
Start at highest weight node
of G as the kth cluster.
Cluster formation
Flowchart of
the proposed
Algorithm
Yes
Start at highest degree node
of G as the kth cluster.
Generate the neighbors of the kth cluster in G.
and sort them according to priority.
Add the highest prority neigbor (p) to the cluster.
Yes
cpp(k-p) > cpin
Yes
dk > din
No
No
Deduct the last added node from kth cluster.
All neighbors of kth
cluster are checked?
No
Add the next priority
neighbor (p) to kth cluster.
Yes
Output & update
Print kth cluster.
G  G – kth cluster
k  k+1.
End
-
The proposed Algorithm
01000000000000
10110100000000
01011100000000
01101101000000
00110100000000
01111010000000
M=
00000100001000
00010000100000
Muv = 1 if there is an edge between
nodes u and v and 0 otherwise.
00000001010011
00000000101011
00000010010000
00000000000010
00000000110101
00000000110010
The proposed Algorithm
10110100000000
04223211000000
12432311000000
12352310100000
03223211000000
12332501001000
M2 =
01111020010000
01101102010011
00010000421122
(M2)uv for uv represents the
number of common neighbor of the
nodes u and v.
00000011240122
00000100102011
00000000110101
00000001221042
00000001221123
The proposed Algorithm
10110100000000
04223211000000
12432311000000
12352310100000
03223211000000
12332501001000
M2 =
01111020010000
01101102010011
00010000421122
(M2)uv for uv represents the
number of common neighbor of the
nodes u and v.
00000011240122
00000100102011
00000000110101
00000001221042
00000001221123
The proposed Algorithm
2
3
3
2
0
2
2
2
0
2
0
2
3
2
2
2
0
2
0
2
0
The weights of edges are derived by squaring the
associated matrix of the graph
0
The proposed Algorithm
10
0
2
6
3
3
2
2
0
6
2
10
0
10
2
0
0
6
6
2
0
2
6
3
2
2
2
0
2
0
6
0
2
0
0
0
The weights of nodes (sum of the weights of the connecting edges)
The proposed Algorithm
10
Seed
6
2
0
6
3
2 3
2
2
10
2
0
0
2
0
2 2
0
0
6
6
2
2
2
0
3
6
6
10
0
2
0
0
0
0
Neighbors
P1
P3
P4
P5
Sum of edge
weights
2
# of
edges
1
3
2
3
1
1
1
The proposed Algorithm
10
0
2
3
2
2
0
6
2
10
3
6
2
10 0
2
6
3
2
2
2
0
0
6
2
0
2
0
6
0
2
0
6
0
0
0
Neighbors
P3
P5
P1
P4
Sum of edge
weights
3
# of
edges
1
3
2
2
1
1
1
cp of P3 = 1
The proposed Algorithm
10
6
d=1.0
2
0
6
3
2 3
2
2
10
2
0
0
2
0
2 2
0
0
6
6
2
2
2
0
3
6
6
10
0
2
0
0
0
0
Neighbors
P1
P4
P5
P7
Sum of edge
weights
4
# of
edges
2
4
6
0
2
2
1
The proposed Algorithm
10
6
d=1.0
2
0
6
3
2 3
2
2
10
2
0
0
2
0
2 2
0
0
6
6
2
2
2
0
3
6
6
10
0
2
0
0
0
0
Neighbors
P5
P1
P4
P7
Sum of edge
weights
6
# of
edges
2
4
4
0
2
2
1
cp of P5 = 1
The proposed Algorithm
10
6
d=1.0
0
2
3
2
2
0
6
2
10
3
3
0
6
2
4
4
0
0
0
6
0
Sum of edge # of
weights
edges
P1
P4
P6
P7
0
2
0
0
0
6
2
2
0
2
0
Neighbors
2
10 0
2
6
2
2
2
1
1
cp of P1 = 1
The proposed Algorithm
10
6
d=1.0
0
2
3
2
0
6
2
2
3
3
10
2
10 0
2
6
2
2
2
0
0
6
2
P0
P4
P6
P7
Sum of edge # of
weights
edges
0
1
4
0
0
0
6
0
0
0
Neighbors
0
2
0
2
0
6
2
1
1
The proposed Algorithm
10
6
d=1.0
0
2
3
2
2
0
6
2
10
3
3
0
0
6
2
P4
P0
P6
P7
0
6
0
Sum of edge # of
weights
edges
4
2
0
0
0
0
2
0
0
0
6
2
2
2
0
Neighbor
s
2
10 0
2
6
2
1
1
1
cp of P4 = 0.75
The proposed Algorithm
10
6
d=0.9
0
2
3
2
0
6
2
2
2
10 0
2
6
3
10
3
2
2
2
0
0
6
2
0
2
0
6
0
2
0
6
0
0
0
Neighbors
Sum of edge
weights
# of
edges
cpvalue
P0
0
1
~0.22
P6
0
1
~0.22
P7
0
1
~0.22
The proposed Algorithm
6
0
2
2
2
2
0
0
Seed
0
The remaining graph
6
6
2
0
2
0
6
0
0
The proposed Algorithm
6
d=1.0
0
2
2
2
2
0
0
6
6
2
0
2
0
6
0
0
0
The proposed Algorithm
6
d=1.0
0
2
2
2
2
0
0
6
6
2
0
2
0
6
0
0
0
The proposed Algorithm
6
d=1.0
0
2
2
2
2
0
0
6
6
2
0
2
0
6
0
0
0
The proposed Algorithm
The remaining graph
The proposed Algorithm
Clustering by the proposed algorithm
Example
A
B
C
D
L
E
F
G
I
H
(ⅰ)
K
J
1. Input and Initialized cpin=0.4, din = 0.6
A
B
C
D
L
E
F
G
I
H
(ⅰ)
K
J
1. Seed Selection-1: calculation of weights of edges
A
2
2
B
2
C
2
3
1
D
2
L
1
1
0
2
E
F
1
0
1
G
1
1
H
K
1
I
1
J
1. Seed selection-2: Calculation of weights of nodes
6
A
2
2
8
B
6
2
C
2
3
1
4
2
Selected seed
D 10
2
L
1
1
0
F
E
2
1
2
0
1
G
1
1
H
2
クラスター1のシード選択
(ⅲ)
2
2
K
1
I
2
1
J
2
2. Cluster formation-1 Calculation of weights of nodes
2
Cluster 1
d1=1
2
A
B
3
Candidate merged to Cluster 1
C
2
3
1
2
L
2
D Cluster 1
d1=1
2
1
E
1
F
G
I
H
K
クラスター１の形成
(ⅳ)
J
2. Cluster formation-2
A
cpC1=1/(1*1)=1 > 0.4 (cpin )
4
2
2
B
2
C
2
1
D
2
1
L
3
Candidate merged to Cluster 1
4
Check thresholds  OK
d1=1/1=1 > 0.6
E
1
F
G
I
H
K
クラスター１の形成
(ⅴ)
J
2. Cluster formation-3
Cluster 1 d1=3/3=1
A
cpA1=2/(1x2)=1>0.4
2
B
2
6
2
C
1
D
2
L
3
F
1
E
1
G
I
H
K
クラスター１の形成
(ⅵ)
J
2. Cluster formation-4
Check thresholds  OK
d1=1/1=1 > 0.6
A
cpB1=3/(1x3)=1 > 0.4 (cpin )
B
C
1
3
D
2
L
Candidate merged to Cluster 1
F
1
E
1
G
I
H
クラスター1の形成
(ⅶ)
K
J
2. Cluster formation-5
Check thresholds  OK
d1=8/10=0.8 > 0.6
A
cpL1=2/(1*4)=0.5 > 0.4 (cpin )
B
C
D
L
0
0
F
1
1
E
Candidate merged to Cluster 1
2
G
I
H
クラスター1の形成
(ⅷ)
K
J
2. Cluster formation-6
Check thresholds  OK
d1=10/15=0.67 > 0.6
A
cpE1=2/(0.8*5)=0.6 > 0.4 (cpin )
B
C
D
L
0
0
F
E
0
G
0
H
クラスター1の探索
(ⅸ)
K
I
Candidate merged to Cluster 1
J
2. Cluster formation-7
Check thresholds  Out
d1=11/12=0.52 < 0.6
A
cpE1=1/(0.52*6)=0.32 < 0.4 (cpin )
B
C
D
L
0
0
F
E
0
G
0
H
クラスター1の探索
(ⅸ)
I
K
J
2. Cluster formation-8
Check thresholds  Out
d1=11/12=0.52 < 0.6
A
cpF1=1/(0.52*6)=0.32 < 0.4 (cpin )
B
C
D
L
0
0
F
E
0
G
0
H
クラスター1の探索
(ⅸ)
I
K
J
2. Cluster formation-8
Check thresholds  Out
d1=11/12=0.52 < 0.6
A
cpF1=1/(0.52*6)=0.0 < 0.4 (cpin )
B
C
D
L
0
0
F
E
0
G
0
H
クラスター1の探索
(ⅸ)
I
K
J
2. Cluster formation-9: Remove the edges and nodes belonging to Cluster 1
F
G
I
H
クラスター1を削除
(ⅹ)
K
J
Results of Density Periphery Clustering
A
B
C
DCluster 1
d1=10/15=0.67
L
E
F
G
Cluster 2
d2=3/3=1
Cluster 3
d3=3/3=1
I
H
ⅰ
終了
(ⅹ)
K
J
Results: Complexes in the E. coli PPI Network
http://dip.mbi.ucla.edu/
DIP:339N
GroEL
DIP:1081N
PrnP
DIP:1025N
CarB
DIP:1026N
CarA
DIP:539N
MalG
DIP:508N
MalE
DIP:124N
XerD
DIP:726N
XerC
DIP:367N
PntB
DIP:366N
PntA
DIP:342N
SbcC
DIP:572N
Gam
--------------
---------
--------------
---------
--------------
---------
--------------
---------
The network of E. coli
proteins consists of
363 interactions
involving a total of
336 proteins
Results: Complexes in the E. coli PPI Network
components of RNA
polymerase
(RpoA,
RpoB, RpoC, Rsd,
RpoZ RpoD, RpoN,
FliA)
Results: Complexes in the E. coli PPI Network
components of ATP
synthetase
(AtpA,
AtpB, AtpE, AtpF,
AtpG, AtpH, AtpL);
Results: Complexes in the E. coli PPI Network
Proteins involved in
cell division (FtsQ,
FtsI, FtsW, FtsN, FtsK
and FtsL)
Results: Complexes in the E. coli PPI Network
components of DNA
polymerase
(DnaX,
HolA, HolB, HolD, and
HolC);
Results: Complexes in the S.
cerevisiae PPI Network
We extract a set of 12487 unique binary interactions
involving 4648 proteins by discarding self-interactions of the
PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/.
Results: Details of a Group
of Predicted Complexes
Information on the complexes
that are of size 6 of the set
generated using din=0.7,
cpin=0.50 and non-overlapping
mode.
We considered 15 functional classes: (1)
Cell cycle and DNA processing, (2) Protein
with binding function or cofactor
requirement (structural or catalytic), (3)
Protein fate (folding, modification,
destination), (4) Biogenesis of cellular
components, (5) Cellular transport,
transport facilitation and transport routes,
(6) Metabolism, (7) Interaction with the
cellular environment, (8) Transcription,
(9) Energy, (10) Cell rescue, defense and
virulence, (11) Cell type differentiation,
(12) Cellular communication/signal
transduction mechanism, (13) Protein
activity regulation, (14) Protein synthesis,
and (15) Transposable elements, viral and
plasmid proteins
ID N
d
Corrected Function Class
P-value
1
5
10
15
Gene Name
1 28 0.71 3.9x10-17
CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,
PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2
2 17 0.72 9.0x10-13
CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2,
SMI1,SWI4
3 14 1.00 1.7x10-11
TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1
4 14 0.83 1.1x10-6
LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6
5 13 0.71 3.7x10
-4
6 12 0.94 3.4x10
-11
7 12 0.71 4.0x10
-6
8 11 0.98 2.1x10
-10
RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3
TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11
COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6
APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1
9 9
0.72 1.9x10-5
CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71
10 8
0.93 4.8x10-7
CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1
11 8
0.72 3.4x10-5
MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4
12 8
0.71 3.1x10
-9
13 8
0.71 4.5x10
-7
14 8
0.71 6.8x10-7
15 7
0.95 3.5x10
-6
16 7
0.76 5.4x10-3
17 7
0.71 1.3x10
-4
0.71 3.5x10
-6
0.80 9.5x10
-4
20 6
0.80 1.3x10
-7
21 6
0.73 6.3x10-10
18 7
19 6
-4
BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20
ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11
CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1
CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1
CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2
SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10
SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c
YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c
PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c
NUP49,NUP57,APG17,NIC96,NSP1,SEC35
22 6
0.73 1.0x10
23 6
0.73 4.8x10-1
24 6
0.73 2.3x10
-3
25 6
0.73 2.4x10-5
SEC2,SEC4,SEC10,SEC15,MYO2,SMY1
26 6
0.73 1.0x10-4
MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1
27 6
0.73 1.2x10-3
DBF2,DBF20,CDC15,LTE1,MOB1,SPO12
28 6
0.73 1.8x10-5
HHF1,HHF2,HHT1,HHT2,SPT6,STH1
29 6
-5
0.73 2.3x10
KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84
ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w
YIP1
ERB1,HAS1,NIP7,NOP7,NUG1,SSF1
CBF1,CEP3,CHL4,CTF13,MCM21,MIF2
GCS1
YDR425w
YGL161c
(a)
(b)
YGL198w
YPL095c
Results: Hypergeometric distribution
 F  N  F 
 i  C  i 
k 1




P  1 
i0
N
C
 
N= Total number of proteins in the network
F= Number of proteins of a functional group in the network
C= Number of proteins in a cluster
k= Number of proteins of a functional group in a cluster
The p-value of a cluster implies the probability that the
proteins of the cluster have been randomly selected
The lower the p-value the higher the statistical significance
P-value & Hyper geometric distribution
3 green and 4 red balls
Put them in a box
Randomly choose any 3
P0(# of red ball is 0) =
P2(# of red ball is 2) =
 4  3 
  
 0  3   1
35
7
 
 3
 4  3 
  
 2  1   18
35
7
 
 3
Notice that, P0 +P1+P2+P3=1
P1(# of red ball is 1) =
P3(# of red ball is 3)
 4  3 
  
 1  2   12
35
7
 
 3
 4  3 
  
 3  0   4
=  7  35
 
 3
P-value & Hyper geometric distribution
P0(# of red ball is 0) =
P2(# of red ball is 2) =
 4  3 
  
 0  3   1
35
7
 
 3
 4  3 
  
 2  1   18
35
7
 
 3
P1(# of red ball is 1) =
P3(# of red ball is 3)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
 4  3 
  
 1  2   12
35
7
 
 3
 4  3 
  
 3  0   4
=  7  35
 
 3
P-value & Hyper geometric distribution
P0(# of red ball is 0) =
P2(# of red ball is 2) =
 4  3 
  
 0  3   1
35
7
 
 3
 4  3 
  
 2  1   18
35
7
 
 3
P1(# of red ball is 1) =
P3(# of red ball is 3)
P(# of red ball ≤ 1)= P0 +P1
P(# of red ball ≥ 2)=1-(P0 +P1)
P(# of red ball ≥ k)=1-(P0 +P1+…+Pk-1)
 F  N  F 
 i  C  i 
k 1

P  1    
i0
N
C
 
N=7, F=4, C=3
 4  3 
  
 1  2   12
35
7
 
 3
 4  3 
  
 3  0   4
=  7  35
 
 3
ID N
Results: Details of a Group
of Predicted Complexes
Information on the
complexes that are of
size 6 of the set
generated using
din=0.7, cpin=0.50
and non-overlapping
mode.
Protein YDR425w of complex 19
is related to cellular transport
and YIP1, YGL198w, YGL161c
and GCS1 are
related to vesicular transport.
Hence, we predict the functionunknown protein YPL095c of this
complex is a transport related
protein most likely related to
vesicular transport.
d
Corrected Function Class
P-value
1
5
10
15
Gene Name
1 28 0.71 3.9x10-17
CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,
PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2
2 17 0.72 9.0x10-13
CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2,
SMI1,SWI4
3 14 1.00 1.7x10-11
TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1
4 14 0.83 1.1x10-6
LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6
5 13 0.71 3.7x10
-4
6 12 0.94 3.4x10
-11
7 12 0.71 4.0x10
-6
8 11 0.98 2.1x10
-10
RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3
TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11
COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6
APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1
9 9
0.72 1.9x10-5
CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71
10 8
0.93 4.8x10-7
CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1
11 8
0.72 3.4x10-5
MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4
12 8
0.71 3.1x10
-9
13 8
0.71 4.5x10
-7
14 8
0.71 6.8x10-7
15 7
0.95 3.5x10
-6
16 7
0.76 5.4x10-3
17 7
0.71 1.3x10
-4
0.71 3.5x10
-6
0.80 9.5x10
-4
20 6
0.80 1.3x10
-7
21 6
0.73 6.3x10-10
18 7
19 6
-4
BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20
ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11
CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1
CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1
CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2
SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10
SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c
YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c
PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c
NUP49,NUP57,APG17,NIC96,NSP1,SEC35
22 6
0.73 1.0x10
23 6
0.73 4.8x10-1
24 6
0.73 2.3x10
-3
25 6
0.73 2.4x10-5
SEC2,SEC4,SEC10,SEC15,MYO2,SMY1
26 6
0.73 1.0x10-4
MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1
27 6
0.73 1.2x10-3
DBF2,DBF20,CDC15,LTE1,MOB1,SPO12
28 6
0.73 1.8x10-5
HHF1,HHF2,HHT1,HHT2,SPT6,STH1
29 6
-5
0.73 2.3x10
KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84
ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w
YIP1
ERB1,HAS1,NIP7,NOP7,NUG1,SSF1
CBF1,CEP3,CHL4,CTF13,MCM21,MIF2
GCS1
YDR425w
YGL161c
(a)
(b)
YGL198w
YPL095c
Conclusions
•In this work, we present an algorithm to detect locally
dense regions in undirected simple graphs.
•The algorithm can be used to detect protein complexes in
large protein-protein interaction networks or co-expressed
gene clusters based on microarray data.
•It can also be used for protein/gene function prediction by
way of finding complexes/clusters in networks consisting of
function known and function unknown proteins.
•Also, DPClus can be applied to other networks where
finding cohesive groups is an agenda.
The DPClus software is available at
http://kanaya.naist.jp/DPClus/
2. The DPClus Software
The DPClus software has been developed based on the
proposed algorithm.
Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi,
Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery
Based Graph Clustering Software Mainly Focused on Detection of
Protein Complexes in Interaction Networks”, Journal of Computer
Aided Chemistry , Vol.7, 150-156, 2006.
The DPClus software is available at
http://kanaya.naist.jp/DPClus/
The DPClus Software
The main window of DPClus
The DPClus Software
The input file format
AtpB
AtpG
AtpA
AtpB
AtpG
AtpE
List of
edges
00101
00011
10001
01001
11110
Adjacency matrix
AtpA
AtpE
AtpH
AtpH
AtpH
AtpH
Corresponding
network
AtpA
AtpB
AtpH
AtpG
AtpE
AtpB, AtpH
AtpA , AtpH
AtpB, AtpA, AtpG, AtpE
AtpH, AtpE
AtpG
Adjacency list
The DPClus Software
Output file format
ClusterLength of cluster 1 is: 8
RpoA
RpoB
RpoC
Rsd
RpoZ
RpoD
RpoN
FliA
ClusterLength of cluster 2 is: 8
AtpH
AtpG
AtpB
AtpA
AtpF
AtpL
AtpE
AtpB(A)
ClusterLength of cluster 3 is: 5
---------------------------------------------------------------------------
The DPClus Software
Click!
Intra cluster edges are green and inter cluster edges are red
Nodes have been arranged by dragging
The DPClus Software
Click
Click
Click
Hierarchical graph of the clusters
The DPClus Software
Clustering of microarray data
Sample microarray data
To apply DPCcus, we need to convert this data to a network
The DPClus Software
Experiment ID
Gene-Gene correlation
m
Genes
Rij 
 (x
k 1
m
 (x
k 1
Select highly correlated gene pairs
At3g10060
At3g10060
At3g10060
---------------------------
At3g54150
At3g63140
At5g07020
-------------------------
Edges of a Network
ik
ik
 xi )(x jk  x j )
 xi )
m
2
 (x
k 1
jk
 x j )2
The DPClus Software
# of experiments 626
Threshold correlation 0.95
cp value 0.5
density value 0.9
Minimum cluster size 3
The DPClus Software
Electron transport clusters
Ribosomal protein
clusters
Photosynthesis clusters
The DPClusO Algorithm
Partitioning a PPI Network into Overlapping Modules
Constrained by High-Density and Periphery Tracking
Md. Altaf-Ul-Amin, Masayoshi Wada and Shigehiko Kanaya
Volume 2012 (2012), Article ID 726429, ISRN Biomathematics
DPClusO has been developed with similar concepts like DPClus but
DPClusO is more general and advantageous.
•each node goes to at least one cluster
•no two clusters are completely the same
•density of each cluster is more than or equal to user given density
• clusters are constrained by periphery if that exists
Major differences with DPClus
•each node goes to at least one cluster as big as possible
•Memory efficient
•Faster computation
Example showing difference in clustering by DPClus and
DPClusO
C DAB E FG
HIKLM
QRSTO
NJP
Clustering by DPClus
C DAB E FG
QRSTO
LKIHM
JGI
NJM
HEF
OMN
PNJ
Clustering by DPClusO
In both cases clustering was done using din = 0.6 and cpin = 0.5
Evaluation of DPClusO
Measures used for Evaluation
Overlapping score:
How two clusters match with each
other
How a set of predicted
clusters match with a set
of known clusters
How rich a cluster is with
similar function proteins
DPClusO generated clusters are not too overlapping
Plot of the number of clusters generated by DPClusO with respect to
maximum overlapping. OVmax=0 means all modules are completely
non-overlapping. For other points OVmax indicates the maximum
overlapping score between any two modules.
500
(a) Union
DPClusO detected
more known protein
complexes
(b) Krogan
400
300
200
100
0
Plots showing how many
and to what extent the
known protein complexes
(all complexes and size 3
or
more
complexes
shown separately) of
yeast
matched
with
modules predicted by
DPClusO, COACH and
CORE corresponding to
five different datasets.
# of matched clusters
0
0.5
1
0
0.5
1
0.5
1
500
(c) DIP
(d) Gavin
400
300
200
100
0
0
0.5
1
0
500
(e) MIPS
DPClusO
400
Coach
300
Core
DPClusO/3
200
Coach/3
100
Core/3
0
0
0.5
1
OV
79
By adding simple filtering
DPClusO achieved the best
F-measure
Variation of F-measure with maximum
overlapping score (used as a filtering
parameter) for modules of size 3 or more
generated by DPClusO, COACH and
CORE. The marked horizontal lines
indicate F-measures for three algorithms in
case of no filtering.
80
500
(a) for 5% changes(b) for 5% changes (S3)
Original
Add
Remove
Rearrange
# of matched clusters
DPClusO is a robust
algorithm
0
0.0
0.5
1.0 0.0
0.5
1.0
500
(c) for 10% changes
(d) for 10% changes (S3
0
0.0
0.5
1.0 0.0
0.5
1.0
OV
Verifying robustness of DPClusO by comparing generated modules from real and
randomly altered PPI networks in the context of matching with known complexes.
(a) & (b) In case of addition, removal and rearrangement of 5% edges in the
context of all and size 3 or more known complexes respectively. (c) & (d) In case
of addition, removal and rearrangement of 10% edges in the context of all and
size 3 or more known complexes respectively.
DPClusO detected
modules are rich with
similar function proteins
Comparison
between
the
distributions of the high
density modules and randomly
selected protein groups with
respect to –log(p-value) in the
contexts of three types of gene
ontology terms: (a), (b)
biological process(BP), (c), (d)
cellular cpmpartment (CC), (e),
(f) molecular function(MF).
82
1000
(b) BP(random)
500
0
# of clusters
Also as a consequence of
DPClusO clustering it was
learnt that a PPI network is a
combination of mainly high
density and star-like modules.
(a)BP
1
30
60
90
1 20
150
1
30
60
90
1 20
150
90
1 20
150
90
1 20
150
1000
(d) CC(random)
(c)CC
500
0
1
1000
30
60
90
1 20
150
1
30
60
(f) MF(random)
(e) MF
500
0
1
30
60
90
1 20
150
1
30
60
-log(p-value)
Comparison between the distributions of the star and star like modules and
randomly selected protein groups with respect to –log(p-value) in the contexts of
three types of gene ontology terms: (a), (b) biological process(BP), (c), (d)
cellular cpmpartment (CC), (e), (f) molecular function(MF).
DPClusO is a network clustering algorithm
Easily we can convert multivariate data into
networks and apply DPClusO for clustering
DPClusO is freely available at:
http://kanaya.naist.jp/DPClusO