Transcript Document
An Efficient Algorithm for
Enumerating Pseudo Cliques
Takeaki Uno
National Institute of Informatics
&
The Graduate University for Advanced Studies
Dec/18/2007 ISAAC, Sendai
Introducing Pseudo Cliques
Analyzing Large Scale Database
• By rapid growth of database size, we have to analyze databases
in some computational way
• Finding cliques in similarity/relation graphs is a popular way to
classify the data, or get characterizations of the data
Group of similar
or related objects
• Thanks to good properties such as monotonicity, (maximal) cliques
can be enumerated very quickly (up to 1,000,000/sec)
・ Now, we are motivated to find more rich object, dense structures,
such as pseudo cliques
Def. Pseudo Clique
• For a vertex set K, the density of K is
(#edges connecting vertices in K)
(|K|-1)|K| /2
- K is a clique density is 1
- K is an independent set density is 0
if density is high, K is nearly a clique
maximum
#edges in S
ave. ratio of
vertices adjacent
to a vertex
For given θ, K is a pseudo clique (density of K) ≧ θ
We want to solve the problem of
enumerating all pseudo cliqus of the given graph
Existing Results
• Easy to find one pseudo clique
two connected vertices always form a pseudo clique
• Finding a pseudo clique of size k is NP-complete
Reducing k-clique problem by setting θ= 1
• Approximation algorithms for maximizing the density for size k
- O(|V|1/3-ε) approaximation algorithm
- O((n/k)ε) approx. if optimal solution is dense [Tokuyama el al.]
- PTAS if Ω(n2) edges [Arora et al.]
• Many heuristic algorithms in data mining, data engineering, natural sciences
• However, no algorithm for "complete" enumeration
Hardness for Branch-and-Bound
• A straightforward approach is branch and bound
• In each iteration, divide the
problem into two non-empty
problems by the
inclusion of a vertex
v1
v1, v2
The existence of
pseudo clique is
NP-comp.
v1
v1, v2
v1, v2
v1, v2
Proof of the Hardness
Theorem 1 For given graph G, threshold θ, and vertex set U,
the problem of checking the existence of a pseudo clique
including U is NP-complete
Proof: reducing the problem of clique of k vertices
Add 2|V|2
vertices as U
input graph
G=(V,E)
|V|2 -1
density =
|V|2
θ=
|V|2 -1
|V|2 +ε
• only (U + clique) is pseudo clique
• density increases by increase of
pseudo clique size
• setting εs.t. clique of size at least k
induces a pseudo clique
Is This Really Hard?
• We proved NP-hardness for "very dense graphs"
unclear for middle dense graph
possibility for polynomial time enumeration
θ= 1
easy
easy
θ= 0
hard
?????
Polynomial Time Enumeration
Reverse Search Approach
• Introduce an acyclic parent-child relation on all pseudo cliques
objects
Enumeration by traversing the tree induced by the relation
Need an algorithm for listing up all children
Parent of Pseudo Clique
• v*(K) : min. deg. min. index vertex in G[K]
• The parent of pseudo clique K K\v*(K)
The parent of K
K
• Density of K = ave. degree G[K] / (|K|-1)
• The parent is the removal of most "sparse" vertex from K, thus
is a pseudo clique
• The parent is smaller than its child acyclic relation
Ex. Enumeration Tree
•••
• threshold = .7
•••
3
6
1
2
4
5
7
Finding Children
• A child is obtained by adding a vertex to the parent
• degK(v): #vertices in K adjacent to v
(can be maintained in O(Δ) time for vertex addition)
• K∪v is a child of K
① K∪v is a pseudo clique lower bound for degK(v)
② v*(K∪v) = v upper bound for degK(v)
- degK(v) < min. deg. of K
K∪v is always a child
- degK(v) > min. deg. of K +1 K∪v never be a child
• degK(v) = min. deg. of K or +1 next slide…
Detailed Condition
• S(K): sequence of vertices in K in the order of (degree, index)
• v is a child v is the top of S(K∪v)
top of S(K) is v*(K)
• v is child only if v is adjacent to all vertices preceding to v in S(K)
• For each vertex, find the first "non-adjacent vertex" in S(K)
• This can be done in O(Δ2) time
Computation time for one iteration is O(Δ2 + log |V|)
( O(Δk + log |V|) if k-degenerate)
Computational Experiments
Implementation
• Code is a simple version
- update |degK(vi)| at each addition
adding u to K takes O(deg(u)) time
- to find children, vi satisfying
θ|K|(|K|+1) - (#edges in K) ≦ | degK(vi)| ≦ d*(K)+1
O( C d*(K)) = O(|E|) time
+ O(1) time for each
C := #vertices vi, | degK(vi)| = d*(K), d*(K)+1
Seems to be not large for #children
Problem Instances
• Pentium M 1.1GHz, 256MB memory, Cygwin, C, gcc
• Test instances are:
- random graphs
(make edge with probability p),
- locally dense random graphs
(vertex i is adjacent to vertices from i-k to i+k with probability 1/2
- graphs generated from real-world data
(co-author graph)
Random Graphs
• p= 0.1, #vertices = 200 to 2000, threshold 0.8, 0.9
r a ndom gr a ph p=0.1
#clique
time per 1M clique
time clique
#p-clique 0.9
time per 1M 0.9
time 0.9
#p-clique 0.8
time per 1M 0.8
time 0.8
1000000000
100000000
10000000
1000000
100000
1000
100
10
1
0.1
6400
4524
3200
2262
1600
1131
800
565
400
282
0.01
200
time (sec) & #cliques
10000
#vertices
Computation time linearly increase as ave. degree
Locally Dense Random Graph
• make edge from a vertex to its neighbors with p=0.5
• #vertices 100 to 25600, threshold 0.8, 0.9
locally dense random graph
#clique
1000000000
time per 1M clique
100000000
10000000
time clique
1000000
#p-clique 0.9
time per 1M 0.9
10000
1000
time 0.9
100
#p-clique 0.8
10
time per 1M 0.8
1
0.1
time 0.8
3E+05
64000
16000
4000
0.01
1000
time (sec) & #cliques
100000
#vertices
• 10 times slower than clique enumeration
• computation time per one clique does not change
Randomly Generated Scale Free Graph
• Add vertices of degree 10 iteratively, to a clique of 10 vertices
• Vertices to be connected are chosen according to their current degrees
10000000
1000000
100000
#clique
time per 1M clique
time clique
#p-clique 0.9
time per 1M 0.9
time 0.9
#p-clique 0.8
time per 1M 0.8
time 0.8
10000
1000
100
1
0.1
16
00
0
32
00
0
64
00
12 0
80
0
25 0
60
00
80
00
40
00
20
00
0.01
10
00
time & #cliques
10
#vertices
Computation time increases quite slowly
Real-world Instance
• co-author graph of academic paper database
• #vertices = 30,000, #edges = 125,000, scale free
1000000000
real-world data
100000
#p-clique
time
time per 1M
1000
10
0.83
0.85
0.88
0.9
0.93
0.95
0.98
1
0.1
1
time & #p-cliques
10000000
threshold
Computation time for one pseudo clique
does not depend on threshold
Bottom-wideness
• Why good in practice?
• The algorithm generates several recursive calls
recursion tree expands exponentially by going down
computation time is dominated by the lowest levels
• On lower levels, small degree vertices are added fast!
Long time
・・・
Short time
When pseudo cliques are sufficiently large (over 5?)
min. degree is small on average
computation time is short on average at lower levels
Conclusion
• First polynomial delay polynomial space algorithm for enumerating
pseudo cliques
• Hardness result for straight forward branch-and-bound
• Evaluate practical efficiency by computational experiments
Future works:
• Explain the gap between theory and practice
• Introduce maximality and their enumeration
• Apply the technique to other structures (pseudo bla bla bla)
(path, tree, bipartite clique, matching …)
• What is crucial for the compuation (enumeration) of structures with
ambiguity