PowerPoint 演示文稿 - Peking University

Download Report

Transcript PowerPoint 演示文稿 - Peking University

Efficient Cohesive Subgraph
Detection in Parallel
Yingxia Shao
Lei Chen
Bin Cui 
 School
of EECS, Peking University
 Hong Kong University of Science and Technology
1
Outline
• Background
• Preliminaries
• PETA: parallel and efficient truss detection algorithm
•
•
•
•
Basic framework
Triangle complete subgraph
Subgraph-oriented model
Optimization techniques
• Evaluation
• Conclusion
2
Background
Cohesive Subgraph
• Cohesive subgraph
• identifies cohesive subgroups within
social networks.
• helps social network analysts focus
on areas of the network that are
likely to be fruitful.
• E.g., clique, 𝑛-clique, 𝑛-clan, 𝑛-club,
𝑘-plex, 𝑘-core, etc.
Image source: Large Scale Cohesive Subgraph Discovery for Social Network Visual Analysis, VLDB’13
3
Background
𝐾-Truss
l
• 𝑘-truss is a cohesive subgraph, where the
support of each edge is at least 𝑘-2.
• Support of an edge is the number of triangles
that contain the edge.
• The maximal 𝑘-truss.
P1
k
P2
b
c
Problem Statement:
Given a graph 𝐺 = (𝑉, 𝐸) and a threshold 𝑘,
finding the maximal 𝑘-truss in 𝐺 .
o
m
P3
j
a
d
h
i
e
g
f
 The subgraph with black thick edges is a 4-truss.
4
Preliminaries
Fundamental operation
• The operation computes the support of an edge.
• Counting triangles around an edge (𝑢, 𝑣).
• Two solutions
• Classic solution [Cohen ’08, VLDB ’13]
1.
2.
Sorts the neighbors of each vertex in ascending order;
Counts triangles in 𝑂(𝑑 𝑣 + 𝑑(𝑢)) time complexity.
• Index-based solution [Wang ’12]
When processing an edge (𝑢, 𝑣), only enumerate neighbors 𝑤 of vertex 𝑢
which has smaller degree;
2. Test the existence of the third edge (𝑤, 𝑣) with the help of a HashTable.
• Time complexity: 𝑂(min{𝑑 𝑢 , 𝑑(𝑣)})
1.
5
Preliminaries
A straightforward detection framework
• The framework is introduced by J.
Cohen.
1. Enumerate triangles
2. For each edge, record the number of
triangles, containing that edge
3. Keep only edges with the support greater
than 𝑘 − 2
4. If step 3 dropped any edges, return to step 1
5. The remained graph is the maximal 𝑘-truss
• Mapreduce solution [J. Cohen ’09]
• Two MapReduce jobs
• One is for steps 1-2.
• One is for step 3.
• Pregel-based solution [L. Quick ’12]
• Three supersteps.
• Two are for steps 1-2
• One is for step 3.
• Classic solution of the fundamental
operation.
Inefficiency
 high communication cost
 large number of iterations
6
Contributions
• We propose a parallel and efficient truss detection algorithm.
• We introduce a subgraph-oriented programming model to efficiently
implement the algorithm into popular graph computation systems.
• We address the edge-support law in real-world graph.
7
PETA
Basic framework
PETA: Parallel and Efficient Truss detection
Algorithm
• New detection framework behind PETA
1.
2.
3.
4.
Each worker constructs a special-designed subgraph;
Simultaneously detects local 𝑘-truss among workers;
Communicates the update when it is unavoidable;
Goto step 2 until all local 𝑘-trusses are stable.
l
l
n
P1
P2
b
m
a
j
a
d
P3
h
i
①
P2
k
(a) original graph
g
k
i
c
m
m
a
3
i
d
d
e
P3
④
i
m
k
P2
b
d
2
f
g
(b) special subgraphs
New detection framework behind PETA
c
j
d
2
e
m
i
h
i
3
f
②③
f
d
j
2
e
d
k
l
b
c
m
k
m
k
n
P1
o
k
o
1
P1
h
i
f
2
e
P3
3
e
f
g
(c) local 𝑘-trusses
8
PETA
TC Subgraph
Triangle Complete Subgraph
Cross Edge
• Triangle Complete Subgraph (TC-subgraph)
• For internal and cross edges,
• TC-subgraph maintains all their triangles at
local.
• Property
External Edge
k
l
P2
b
a
i
d
• TC-subgraph preserves sufficient knowledge.
• Theorem 1 and Theorem 2 prove the
correctness of new framework in PETA with
TC-subgraph.
m
c
e
f
Internal Edge
9
PETA
Subgraph Model
Subgraph-Oriented Model
• The subgraph-oriented model allows to flexibly process the local
subgraph by designing proper APIs.
Vertex-centric Model
Subgraph-oriented Model
Accessible data
Vertex and one-hop neighbors
Entire local subgraph
Access pattern
Sequence
Sequence/random
Local updates
Require extra supersteps
By-pass
User defined function
Simple
Fruitful
Expressivity
Good
Better
• In PETA, we can use index-based approach to detect local k-truss.
*Refer to the paper for API design.
10
PETA
Local algorithm
Local subgraph algorithm for PETA
• The algorithm contains two phases.
• Initialization phase.
• Constructs TC-subgraph via triangle counting routine
• Require two supersteps
• Detection phase.
• Apply index-based solution to compute the support of an edge.
• First detection iteration, scan over internal and cross edges.
• Successive detection iterations, modify local k-truss based on the
removal of external edges.
seamless
detection!
11
PETA
Efficiency
Efficiency analysis of PETA
• Computation Complexity
• It is the same as the one of best-known serial algorithms, 𝑂(𝑚1.5 ).
• Communication Complexity
• Worst case is bounded by 3|Δ|.
• The number of iteration
• It is minimal when a graph partition is given.
• Space complexity
• Worst case is bounded by 2𝑚 + 3|∆|.
• Drawback
• The worst space cost is achievable in theoretic, thus it may be infeasible
for large scale graphs.
• e.g., clique
12
Optimizations
Optimizations - I
• Edge replicating factor (𝜌)
Cross Edge
• is the average number that an edge is
replicated.
• Small 𝜌 leads to low space cost,
computation cost and communication cost.
• Small 𝜌 implies few number of iterations.
External Edge
k
l
P2
b
m
a
i
d
c
e
f
Internal Edge
𝛾 is edge cut ratio and stands
for cross edge replication.
The third term is external edge
replication, where 𝜃(𝑒) is the
support of an edge.
13
Optimizations
Optimizations - II
• Edge-Support Law in real-world graph
• The frequency distribution of the initial support of edges follows Power-Law.
14
Optimizations
Optimizations - III
• Edge-balanced partition strategy
• Improve the performance of the algorithm further.
• Use METIS to generate a “good” partition.
• Since METIS is unable to balance the core edges directly, we assign each vertex’s degree as its
weights, and balance the degree as an indicator for core edge balance.
Graph
E[θ(e)]
ρest
ρrand
ρmetis
livejournal
20.00
20.74
8.99
1.77
us-patent
2.36
3.25
3.13
1.19
wikitalk
5.93
7.53
4.52
3.31
dbpedia
7.61
9.11
6.77
2.10
Graphs are partitioned into 32 parts.
15
Evaluation
Evaluation
• All experiments are conducted on a cluster with 23 physical nodes.
Graph
|V|
|E|
wikitalk
2.4M
4.7M
us-patent
3.8M
16.5M
livejournal
4.8M
42.9M
dbpedia
17.2M
117.4M
• Baselines
• Cohen-MR [J. Cohen ’09]
• Orig-LQ [L. Quick ’12]
16
Evaluation
Efficiency
livejournal
us-patent
• On different datasets, PETA achieves 5x to 6x speedup compared with original pregelbased solution (i.e., orig-LQ and impr-LQ).
• The performance of Cohen-MR is at least 10X slower than the best one. So, it is not
visualized for figures’ clarity.
17
Evaluation
Scalability
10-truss in dbpedia
40-truss in dbpedia
• The performance of PETA improves gracefully on both random and
METIS-based partition schemes.
18
Conclusions
• We designed an efficient parallel 𝑘-truss detection algorithm, named
PETA and thoroughly prove the advantages of PETA.
• The subgraph-oriented model has potential to improve the
performance of other complex graph analysis tasks.
• In future, we will solve other truss-related problem under this
framework.
19
Q&A
20
Backup Expr. – the number of iterations
K-truss
Orig-LQ
Cohen-MR
PETA
Random
METIS
5-truss
2212(503)
1006(503)
21
9
10-truss
272(68)
136(68)
23
14
40-truss
112(28)
56(28)
14
6
The number of iterations for k-truss detection on dbpedia
21