Transcript Document
An Efficient Algorithm for Enumerating Pseudo Cliques Takeaki Uno National Institute of Informatics & The Graduate University for Advanced Studies Dec/18/2007 ISAAC, Sendai Introducing Pseudo Cliques Analyzing Large Scale Database • By rapid growth of database size, we have to analyze databases in some computational way • Finding cliques in similarity/relation graphs is a popular way to classify the data, or get characterizations of the data Group of similar or related objects • Thanks to good properties such as monotonicity, (maximal) cliques can be enumerated very quickly (up to 1,000,000/sec) ・ Now, we are motivated to find more rich object, dense structures, such as pseudo cliques Def. Pseudo Clique • For a vertex set K, the density of K is (#edges connecting vertices in K) (|K|-1)|K| /2 - K is a clique density is 1 - K is an independent set density is 0 if density is high, K is nearly a clique maximum #edges in S ave. ratio of vertices adjacent to a vertex For given θ, K is a pseudo clique (density of K) ≧ θ We want to solve the problem of enumerating all pseudo cliqus of the given graph Existing Results • Easy to find one pseudo clique two connected vertices always form a pseudo clique • Finding a pseudo clique of size k is NP-complete Reducing k-clique problem by setting θ= 1 • Approximation algorithms for maximizing the density for size k - O(|V|1/3-ε) approaximation algorithm - O((n/k)ε) approx. if optimal solution is dense [Tokuyama el al.] - PTAS if Ω(n2) edges [Arora et al.] • Many heuristic algorithms in data mining, data engineering, natural sciences • However, no algorithm for "complete" enumeration Hardness for Branch-and-Bound • A straightforward approach is branch and bound • In each iteration, divide the problem into two non-empty problems by the inclusion of a vertex v1 v1, v2 The existence of pseudo clique is NP-comp. v1 v1, v2 v1, v2 v1, v2 Proof of the Hardness Theorem 1 For given graph G, threshold θ, and vertex set U, the problem of checking the existence of a pseudo clique including U is NP-complete Proof: reducing the problem of clique of k vertices Add 2|V|2 vertices as U input graph G=(V,E) |V|2 -1 density = |V|2 θ= |V|2 -1 |V|2 +ε • only (U + clique) is pseudo clique • density increases by increase of pseudo clique size • setting εs.t. clique of size at least k induces a pseudo clique Is This Really Hard? • We proved NP-hardness for "very dense graphs" unclear for middle dense graph possibility for polynomial time enumeration θ= 1 easy easy θ= 0 hard ????? Polynomial Time Enumeration Reverse Search Approach • Introduce an acyclic parent-child relation on all pseudo cliques objects Enumeration by traversing the tree induced by the relation Need an algorithm for listing up all children Parent of Pseudo Clique • v*(K) : min. deg. min. index vertex in G[K] • The parent of pseudo clique K K\v*(K) The parent of K K • Density of K = ave. degree G[K] / (|K|-1) • The parent is the removal of most "sparse" vertex from K, thus is a pseudo clique • The parent is smaller than its child acyclic relation Ex. Enumeration Tree ••• • threshold = .7 ••• 3 6 1 2 4 5 7 Finding Children • A child is obtained by adding a vertex to the parent • degK(v): #vertices in K adjacent to v (can be maintained in O(Δ) time for vertex addition) • K∪v is a child of K ① K∪v is a pseudo clique lower bound for degK(v) ② v*(K∪v) = v upper bound for degK(v) - degK(v) < min. deg. of K K∪v is always a child - degK(v) > min. deg. of K +1 K∪v never be a child • degK(v) = min. deg. of K or +1 next slide… Detailed Condition • S(K): sequence of vertices in K in the order of (degree, index) • v is a child v is the top of S(K∪v) top of S(K) is v*(K) • v is child only if v is adjacent to all vertices preceding to v in S(K) • For each vertex, find the first "non-adjacent vertex" in S(K) • This can be done in O(Δ2) time Computation time for one iteration is O(Δ2 + log |V|) ( O(Δk + log |V|) if k-degenerate) Computational Experiments Implementation • Code is a simple version - update |degK(vi)| at each addition adding u to K takes O(deg(u)) time - to find children, vi satisfying θ|K|(|K|+1) - (#edges in K) ≦ | degK(vi)| ≦ d*(K)+1 O( C d*(K)) = O(|E|) time + O(1) time for each C := #vertices vi, | degK(vi)| = d*(K), d*(K)+1 Seems to be not large for #children Problem Instances • Pentium M 1.1GHz, 256MB memory, Cygwin, C, gcc • Test instances are: - random graphs (make edge with probability p), - locally dense random graphs (vertex i is adjacent to vertices from i-k to i+k with probability 1/2 - graphs generated from real-world data (co-author graph) Random Graphs • p= 0.1, #vertices = 200 to 2000, threshold 0.8, 0.9 r a ndom gr a ph p=0.1 #clique time per 1M clique time clique #p-clique 0.9 time per 1M 0.9 time 0.9 #p-clique 0.8 time per 1M 0.8 time 0.8 1000000000 100000000 10000000 1000000 100000 1000 100 10 1 0.1 6400 4524 3200 2262 1600 1131 800 565 400 282 0.01 200 time (sec) & #cliques 10000 #vertices Computation time linearly increase as ave. degree Locally Dense Random Graph • make edge from a vertex to its neighbors with p=0.5 • #vertices 100 to 25600, threshold 0.8, 0.9 locally dense random graph #clique 1000000000 time per 1M clique 100000000 10000000 time clique 1000000 #p-clique 0.9 time per 1M 0.9 10000 1000 time 0.9 100 #p-clique 0.8 10 time per 1M 0.8 1 0.1 time 0.8 3E+05 64000 16000 4000 0.01 1000 time (sec) & #cliques 100000 #vertices • 10 times slower than clique enumeration • computation time per one clique does not change Randomly Generated Scale Free Graph • Add vertices of degree 10 iteratively, to a clique of 10 vertices • Vertices to be connected are chosen according to their current degrees 10000000 1000000 100000 #clique time per 1M clique time clique #p-clique 0.9 time per 1M 0.9 time 0.9 #p-clique 0.8 time per 1M 0.8 time 0.8 10000 1000 100 1 0.1 16 00 0 32 00 0 64 00 12 0 80 0 25 0 60 00 80 00 40 00 20 00 0.01 10 00 time & #cliques 10 #vertices Computation time increases quite slowly Real-world Instance • co-author graph of academic paper database • #vertices = 30,000, #edges = 125,000, scale free 1000000000 real-world data 100000 #p-clique time time per 1M 1000 10 0.83 0.85 0.88 0.9 0.93 0.95 0.98 1 0.1 1 time & #p-cliques 10000000 threshold Computation time for one pseudo clique does not depend on threshold Bottom-wideness • Why good in practice? • The algorithm generates several recursive calls recursion tree expands exponentially by going down computation time is dominated by the lowest levels • On lower levels, small degree vertices are added fast! Long time ・・・ Short time When pseudo cliques are sufficiently large (over 5?) min. degree is small on average computation time is short on average at lower levels Conclusion • First polynomial delay polynomial space algorithm for enumerating pseudo cliques • Hardness result for straight forward branch-and-bound • Evaluate practical efficiency by computational experiments Future works: • Explain the gap between theory and practice • Introduce maximality and their enumeration • Apply the technique to other structures (pseudo bla bla bla) (path, tree, bipartite clique, matching …) • What is crucial for the compuation (enumeration) of structures with ambiguity