Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –

Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –

Transcript Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –

Parallel Graph Algorithms
Kamesh Madduri
[email protected]
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness Centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
2
5/4/2009
Parallel Graph Algorithms
Routing in transportation networks
Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
3
5/4/2009
Parallel Graph Algorithms
Internet and the WWW
• The world-wide web can be represented as a directed
graph
– Web search and crawl: traversal
– Link analysis, ranking: Page rank and HITS
– Document classification and clustering
• Internet topologies (router networks) are naturally
modeled as graphs
4
5/4/2009
Parallel Graph Algorithms
Scientific Computing
• Reorderings for sparse solvers
– Fill reducing orderings
• partitioning, traversals, eigenvectors
– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation
of sparsity
• Derivative computations for optimization
– Matroids, graph colorings, spanning trees
• Preconditioning
– Incomplete Factorizations
– Partitioning for domain decomposition
– Graph techniques in algebraic multigrid
• Independent sets, matchings, etc.
– Support Theory
• Spanning trees & graph embedding techniques
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,
http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
5
5/4/2009
Parallel Graph Algorithms
Large-scale data analysis
• Graph abstractions are very useful to analyze complex
data sets.
• Sources of data: petascale simulations, experimental
devices, the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data
quality
Astrophysics: massive datasets,
temporal variations
Bioinformatics: data quality,
heterogeneity
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg
(2,3) www.visualComplexity.com
6
5/4/2009
Parallel Graph Algorithms
Social Informatics: new analytics
challenges, data uncertainty
Data Analysis and Graph Algorithms in Systems Biology
• Study of the interactions
between various components
in a biological system
• Graph-theoretic formulations
are pervasive:
– Predicting new interactions:
modeling
– Functional annotation of novel
proteins: matching, clustering
– Identifying metabolic pathways:
paths, clustering
– Identifying new protein
complexes: clustering, centrality
Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
7
5/4/2009
Parallel Graph Algorithms
Graph –theoretic problems in social networks
– Community identification: clustering
– Targeted advertising: centrality
– Information spreading: modeling
Image Source: Nexus (Facebook application)
8
5/4/2009
Parallel Graph Algorithms
Network Analysis for Intelligence and Survelliance
• [Krebs ’04] Post 9/11
Terrorist Network Analysis
from public domain
information
• Plot masterminds correctly
identified from interaction
patterns: centrality
Image Source: http://www.orgnet.com/hijackers.html
• A global view of entities is
often more insightful
• Detect anomalous activities
by exact/approximate graph
matching
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies
for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
9
5/4/2009
Parallel Graph Algorithms
Characterizing Graph-theoretic computations
Input: Graph
abstraction
Problem: Find ***
• paths
• clusters
• partitions
• matchings
• patterns
• orderings
10
5/4/2009
Graph kernel
• traversal
• shortest path
algorithms
• flow algorithms
• spanning tree
algorithms
• topological
sort
…..
Factors that influence
choice of algorithm
• graph sparsity (m/n ratio)
• static/dynamic nature
• weighted/unweighted, weight
distribution
• vertex degree distribution
• directed/undirected
• simple/multi/hyper graph
• problem size
• granularity of computation at
nodes/edges
• domain-specific characteristics
Graph problems are often recast as sparse
linear algebra (e.g., partitioning) or linear
programming (e.g., matching) computations
Parallel Graph Algorithms
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
11
5/4/2009
Parallel Graph Algorithms
Parallel Computing Models: A Quick PRAM review
• Objectives
– Bridge between software and hardware
• General purpose HW, scalable HW
• Transportable SW
– Abstract architecture for algorithm development
• Why is it so important?
– Uniprocessor: von Neumann model of computation
– Parallel processors Multicore
• Requirements: inherent tension
– Simplicity to make analysis of interesting problems tractable
– Detailed to reveal the important bottlenecks
• Models, e.g.:
– PRAM: rich collection of parallel graph algorithms
– BSP: some CGM algorithms (cgmGraph)
– LogP
12
5/4/2009
Parallel Graph Algorithms
PRAM
• Ideal model of a parallel computer for analyzing the
efficiency of parallel algorithms.
• PRAM composed of
– P unmodifiable programs, each composed of optionally labeled
instructions.
– a single shared memory composed of a sequence of words,
each capable of containing an arbitrary integer.
– P accumulators, one associated with each program
– a read-only input tape
– a write-only output tape
• No local memory in each RAM.
• Synchronization, communication, parallel overhead is
zero.
13
5/4/2009
Parallel Graph Algorithms
PRAM Data Access Forms
• EREW (Exclusive Read, Exclusive Write)
– A memory cell can be read or written by at most one
processor per cycle.
– Ensures no read or write conflicts.
• CREW (Concurrent Read, Exclusive Write)
– Ensures there are no write conflicts.
• CRCW (Concurrent Read, Concurrent Write)
– Requires use of some conflict resolution scheme.
14
5/4/2009
Parallel Graph Algorithms
PRAM Pros and Cons
• Pros
– Simple and clean semantics.
– The majority of theoretical parallel algorithms are
specified with the PRAM model.
– Independent of the communication network topology.
• Cons
– Not realistic, too powerful communication model.
– Algorithm designer is misled to use IPC without
hesitation.
– Synchronized processors.
– No local memory.
– Big-O notation is often misleading.
15
5/4/2009
Parallel Graph Algorithms
Analyzing Parallel Graph Algorithms
• Problem parameters: n, m, D (graph
diameter)
• Worst-case running time: T
• Total number of operations (work): W
• Nick’s Class (NC): complexity class for
problems that can be solved in polylogarithmic time using a polynomial
number of processors
• P-complete: inherently sequential
16
5/4/2009
Parallel Graph Algorithms
The Helman-JaJa model
• Extension to the PRAM model for shared
memory algorithm design and analysis.
• T(n, p) is measured by the triplet –TM(n, p), TC(n,
p), B(n, p)
– TM(n, p): maximum number of non-contiguous main
memory accesses required by any processor
– TC(n, p): upper bound on the maximum local
computational complexity of any of the processors
– B(n, p): number of barrier synchronizations.
17
5/4/2009
Parallel Graph Algorithms
Building blocks of classical PRAM graph algorithms
• Prefix sums
• List ranking
– Euler tours, Pointer jumping, Symmetry
breaking
• Sorting
• Tree contraction
18
5/4/2009
Parallel Graph Algorithms
Prefix Sums
• Input: A, an array of n elements; associative binary
operation i
• Output:  A( j ) for1  i  n
j 1
O(n) work, O(log n) time,
n processors
B(3,1)
C(3,2)
B(2,2)
C(2,2)
B(2,1)
C(2,1)
B(1,1)
C(1,1)
B(1,2)
C(1,2)
B(1,3)
C(1,3)
B(1,4)
C(1,4)
B(0,1) B(0,2) B(0,3) B(0,4) B(0,5) B(0,6) B(0,7) B(0,8)
C(0,1) C(0,2) C(0,3) C(0,4) C(0,5) C(0,6) C(0,7) C(0,8)
19
5/4/2009
Parallel Graph Algorithms
Parallel Prefix
• X: array of n elements stored in arbitrary order.
• For each element i, let X(i).value be its value and X(i).next be the
index of its successor.
• For binary associative operator Θ, compute X(i).prefix such that
– X(head).prefix = X (head).value, and
– X(i).prefix = X(i).value Θ X(predecessor).prefix
where
– head is the first element
– i is not equal to head, and
– predecessor is the node preceding i.
• List ranking: special case of parallel prefix, values initially
set to 1, and addition is the associative operator.
20
5/4/2009
Parallel Graph Algorithms
List ranking Illustration
• Ordered list (X.next values)
2
3
4
5
6
7
8
9
• Random list (X.next values)
4
21
5/4/2009
6
5
7
8
Parallel Graph Algorithms
3
2
9
List Ranking key idea
1. Chop X randomly into s pieces
2. Traverse each piece using a serial algorithm.
3. Compute the global rank of each element using
the result computed from the second step.
• Locality (list ordering) determines performance
• In the Helman-JaJa model, TM(n,p) = O(n/p).
22
5/4/2009
Parallel Graph Algorithms
An example higher-level algorithm
Tarjan-Vishkin’s biconnected components
algorithm: O(log n) time, O(m+n) time.
1. Compute spanning tree T for the input graph G.
2. Compute Eulerian circuit for T.
3. Root the tree at an arbitrary vertex.
4. Preorder numbering of all the vertices.
5. Label edges using vertex numbering
6. Connected components using the ShiloachVishkin algorithm
23
5/4/2009
Parallel Graph Algorithms
Data structures: graph representation
• Dense graphs (m = O(n2)): adjacency matrix commonly
used.
• Sparse graphs: adjacency lists, similar to the CSR matrix
format.
• Dynamic sparse graphs: we need to support edge and
vertex membership queries, insertions, and deletions.
– should be space-efficient, with low synchronization
overhead
• Several different representations possible
– Resizable adjacency arrays
– Adjacency arrays, sorted by vertex identifiers
– Adjacency arrays for low-degree vertices, heap-based
structures for high-degree vertices (for sparse graphs
with skewed degree distributions)
24
5/4/2009
Parallel Graph Algorithms
Data structures in (Parallel) Graph Algorithms
• A wide range of ADTs in graph algorithms:
array, list, queue, stack, set, multiset, tree
• ADT implementations are typically arraybased for performance considerations.
• Key data structure considerations in
parallel graph algorithm design
– Practical parallel priority queues
– Space-efficiency
– Parallel set/multiset operations, e.g., union,
intersection, etc.
25
5/4/2009
Parallel Graph Algorithms
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
26
5/4/2009
Parallel Graph Algorithms
Connected Components
• Building blocks for many graph algorithms
– Minimum spanning tree, spanning tree, planarity
testing, etc.
• Representative of the “graft-and-shortcut”
approach
• CRCW PRAM algorithms
– [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn)
work
– [Gazit ’91]: randomized, optimal, O(log n) time.
• CREW algorithms
– [Han & Wagner ’90]: O(log2n) time, O((m+nlog n)
logn) work.
27
5/4/2009
Parallel Graph Algorithms
Shiloach-Vishkin algorithm
• Input: n isolated vertices and m PRAM processors.
• Each processor Pi grafts a tree rooted at vertex vi to the
tree that contains one of its neighbors u under the
constraints u< vi
• Grafting creates k ≥ 1 connected subgraphs, and each
subgraph is then shortcut so that the depth of the trees
reduce at least by half.
• Repeat graft and shortcut until no more grafting is
possible.
• Runs on arbitrary CRCW PRAM in O(logn) time with
O(m) processors.
• Helman-JaJa model: TM = (3m/p + 2)log n, TB = 4log n.
28
5/4/2009
Parallel Graph Algorithms
SV pseudo-code
• Input: (1) A set of m edges (i, j) given in arbitrary order.
(2) Array D[1..n] with D[i] = i
• Output: Array D[1..n] with D[i] being the component to which vertex i
belongs.
begin
while true do
1. for (i, j) ∈ E in parallel do
if D[i]=D[D[i]] and D[j]<D[i] then D[D[i]] = D[j];
2. for (i, j) ∈ E in parallel do
if i belongs to a star and D[j]=D[i] then D[D[i]] = D[j];
3. if all vertices are in rooted stars then exit;
for all i in parallel do
D[i] = D[D[i]]
end
29
5/4/2009
Parallel Graph Algorithms
SV Illustration
4
2
4
2
1
3
1
3
1st iter
Input graph
2,3
1,4
shortcut
graft
2nd iter
1
30
5/4/2009
2
Parallel Graph Algorithms
1
2
1
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
31
5/4/2009
Parallel Graph Algorithms
Parallel Single-source Shortest Paths (SSSP)
algorithms
• No known PRAM algorithm that runs in sub-linear time
and O(m+nlog n) work
• Parallel priority queues: relaxed heaps [DGST88],
[BTZ98]
• Ullman-Yannakakis randomized approach [UY90]
• Meyer et al. ∆ - stepping algorithm [MS03]
• Distributed memory implementations based on graph
partitioning
• Heuristics for load balancing and termination detection
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel
Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
32
5/4/2009
Parallel Graph Algorithms
∆ - stepping algorithm [MS03]
• Label-correcting algorithm: Can relax
edges from unsettled vertices also
• ∆ - stepping: “approximate bucket
implementation of Dijkstra’s algorithm”
• ∆: bucket width
• Vertices are ordered using buckets
representing priority range of size ∆
• Each bucket may be processed in parallel
33
5/4/2009
Parallel Graph Algorithms
34
5/4/2009
Parallel Graph Algorithms
Classify edges as
“heavy” and “light”
35
5/4/2009
Parallel Graph Algorithms
Relax light edges (phase)
Repeat until B[i] Is empty
36
5/4/2009
Parallel Graph Algorithms
Relax heavy edges. No
reinsertions in this step.
37
5/4/2009
Parallel Graph Algorithms
∆ - stepping algorithm: illustration
∆ = 0.1
(say)
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
∞ ∞ ∞ ∞ ∞ ∞ ∞
Buckets
38
5/4/2009
Parallel Graph Algorithms
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
39
5/4/2009
Parallel Graph Algorithms
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
Initialization:
Insert s into bucket, d(s) = 0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
R
.01
S
40
5/4/2009
Parallel Graph Algorithms
2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
.01
0
S
41
5/4/2009
2
Parallel Graph Algorithms
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
4
5
.01
∞ ∞ ∞ ∞
Buckets
0 2
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
42
5/4/2009
Parallel Graph Algorithms
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
4
5
.01
∞ ∞ ∞ ∞
Buckets
0 2
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
.03 .06
S
43
5/4/2009
Parallel Graph Algorithms
1 3
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
4
5
.01
∞ ∞ ∞ ∞
Buckets
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
.03 .06
0
S
44
5/4/2009
1 3
Parallel Graph Algorithms
0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
.03 .01 .06
4
5
6
∞ ∞ ∞
Buckets
0 1 3
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S 0 2
45
5/4/2009
Parallel Graph Algorithms
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
2
0
0.02
0.13
0.15
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 .03 .01 .06 .16 .29 .62
Buckets
46
1
4
2
6
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
6
5/4/2009
Parallel Graph Algorithms
0 2 1 3
No. of phases (machine-independent performance
count)
1000000
high
diameter
No. of phases
100000
10000
1000
low
diameter
100
10
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
47
5/4/2009
Parallel Graph Algorithms
SqGrid
USAd NE USAt NE
Average shortest path weight for various graph families
~ 220 vertices, 222 edges, directed graph, edge weights normalized to
[0,1]
100000
Average shortest path weight
10000
1000
100
10
1
0.1
0.01
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
48
5/4/2009
Parallel Graph Algorithms
SqGrid
USAd NE USAt NE
Last non-empty bucket
(machine-independent performance count)
1000000
Last non-empty bucket
100000
10000
1000
Fewer buckets,
more parallelism
100
10
1
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
49
5/4/2009
Parallel Graph Algorithms
SqGrid
USAd NE USAt NE
Number of bucket insertions
(machine-independent performance count)
12000000
No. of Bucket insertions
10000000
8000000
6000000
4000000
2000000
0
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
50
5/4/2009
Parallel Graph Algorithms
SqGrid
USAd NE USAt NE
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
51
5/4/2009
Parallel Graph Algorithms
Betweenness Centrality
• Centrality: Quantitative measure to capture the
importance of a vertex/edge in a graph
– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
BC(v) 
 st (v)

s  v  t  st
(  st : No. of shortest paths between s and t)
• Applied to several real-world networks
–
–
–
–
52
Social interactions
WWW
Epidemiology
Systems biology
5/4/2009
Parallel Graph Algorithms
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and
number of shortest paths between all s-t pairs (O(n3)
time), sum up the fractional dependency values (O(n2)
space).
• Brandes’ algorithm (2003): Augment a single-source
shortest path computation to count paths; uses the
Bellman criterion; O(mn) work and O(m+n) space.
53
5/4/2009
Parallel Graph Algorithms
Our New Parallel Algorithms
• Madduri, Bader (2006): parallel algorithms for
computing exact and approximate betweenness
centrality
– low-diameter sparse graphs (diameter D = O(log n), m = O(nlog
n))
– Exact algorithm: O(mn) work, O(m+n) space, O(nD) time (PRAM
model) or O(nD+nm/p) time.
• Madduri et al. (2009): New parallel algorithm with lower
synchronization overhead and fewer non-contiguous
memory references
– In practice, 2-3X faster than previous algorithm
– Lock-free => better scalability on large parallel systems
54
5/4/2009
Parallel Graph Algorithms
Parallel BC Algorithm
• Consider an undirected, unweighted graph
• High-level idea: Level-synchronous
parallel Breadth-First Search augmented
to compute centrality scores
• Two steps
– traversal and path counting
– dependency accumulation
 (v )
1   (w)
 (v )  
vP ( w)  ( w)
55
5/4/2009
Parallel Graph Algorithms
Parallel BC Algorithm Illustration
Data structures
G (size m+n): read-only, adjacency array
representation
BC (size n): centrality score of each
vertex
5
0
7
8
3
2
1
4
6
• S (n): stack of visited vertices
9
• Visited (n): Mark to check if vertex
has been visited
• D (n): Distance of vertex from
source s
• Sigma (n): No. of shortest paths
through a vertex
• Delta (n): Partial dependence
score for each vertex
• P (m+n): Multiset of predecessors
of a vertex along shortest paths
Space requirement: 8(m+6n) Bytes
56
5/4/2009
Parallel Graph Algorithms
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
0
5
8
7
3
source
vertex
2
57
5/4/2009
Parallel Graph Algorithms
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
8
3
source
vertex
2
58
5/4/2009
Parallel Graph Algorithms
6
P
0
1
4
D
9
2
7
5
0
1
0
1
0
1
0
0
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
8
3
source
vertex
2
6
P
0
1
4
D
9
8
3
2
7
5
0
1
2
0
2
1
0
1
2
0
5
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
59
5/4/2009
Parallel Graph Algorithms
7
0
7
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,
their corresponding predecessor multisets, and D values.
5
0
7
8
3
source
vertex
2
1
4
6
9
S
D
2
1
6
4
8
3
2
7
5
0
0
P
1
2
1
1
2
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
60
5/4/2009
Parallel Graph Algorithms
6
0
2
3
0
8
0
5
6
7
8
0
7
Step 1 (traversal) pseudo-code
for all vertices u at level d in parallel do
for all adjacencies v of u in parallel do
v1
u
dv = D[v];
if (dv < 0) // v is visited for the first time
v2
vis = fetch_and_add(&Visited[v], 1);
if (vis == 0) // v is added to a stack only once
D[v] = d+1;
pS[count++] = v; // Add v to local thread stack
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v
if (dv == d + 1)
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v
u1
e1
v
u2
61
5/4/2009
Parallel Graph Algorithms
e2
Graph traversal step analysis
• Exploit concurrency in visiting adjacencies, as we
assume that the graph diameter is small: O(log n)
• Upper bound on size of each predecessor multiset: Indegree
• Potential performance bottlenecks: atomic updates to
predecessor multisets, atomic increments of path counts
• New algorithm: Based on observation that we don’t need
to store “predecessor” vertices. Instead, we store
successor edges along shortest paths.
– simplifies the accumulation step
– reduces an atomic operation in traversal step
– cache-friendly!
62
5/4/2009
Parallel Graph Algorithms
Modified Step 1 pseudo-code
for all vertices u at level d in parallel do
for all adjacencies v of u in parallel do
dv = D[v];
if (dv < 0) // v is visited for the first time
vis = fetch_and_add(&Visited[v], 1);
if (vis == 0) // v is added to a stack only once
D[v] = d+1;
pS[count++] = v; // Add v to local thread stack
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1); // Add v to successor list of u
if (dv == d + 1)
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1); // Add v to successor list of u
63
5/4/2009
Parallel Graph Algorithms
Graph Traversal Step locality analysis
for all vertices u at level d in parallel do
All the vertices are in a
contiguous block (stack)
for all adjacencies v of u in parallel do
All the adjacencies of a vertex are
stored compactly (graph rep.)
Non-contiguous
dv = D[v];
memory access
if (dv < 0)
Non-contiguous
vis = fetch_and_add(&Visited[v], 1);
memory access
if (vis == 0)
D[v] = d+1;
pS[count++] = v;
fetch_and_add(&sigma[v], sigma[u]);
Non-contiguous
fetch_and_add(&Scount[u], 1);
memory access
if (dv == d + 1)
Store to S[u]
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1);
Better cache utilization likely if D[v], Visited[v],
sigma[v] are stored contiguously
64
5/4/2009
Parallel Graph Algorithms
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack,
update dependence scores.
S
5
0
7
8
3
source
vertex
2
65
5/4/2009
Parallel Graph Algorithms
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a
level-synchronous manner.
S
5
0
7
8
3
source
vertex
2
66
5/4/2009
Parallel Graph Algorithms
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Step 2 (accumulation) pseudo-code
for level d = GraphDiameter to 2 do
for all vertices w at level d in parallel do
for all v in P[w] do
acquire_lock(v);
delta[v] = delta[v] + (1 + delta[w]) * sigma(v)/sigma(w);
release_lock(v);
 (v )
1   (w)

(
v
)


BC[v] = delta[v]

(
w
)
vP ( w)
67
5/4/2009
Parallel Graph Algorithms
Modified Step 2 pseudo-code (w/ successor lists)
for level d = GraphDiameter-2 to 1 do
for all vertices v at level d in parallel do
for all w in S[v] in parallel do reduction(delta)
delta[v] = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
BC[v] = delta[v]
68
5/4/2009
Parallel Graph Algorithms
 (v ) 
 (v )
1   (w)


(
w
)
wS ( v )
Accumulation step locality analysis
for level d = GraphDiameter-2 to 1 do
for all vertices v at level d in parallel do
All the vertices are in a
contiguous block (stack)
Each S[v] is a
contiguous
block
for all w in S[v] in parallel do reduction(delta)
delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
BC[v] = delta[v] = delta_sum_v;
Only floating point
operation in code
69
5/4/2009
Parallel Graph Algorithms
Digression: Centrality Analysis applied to Protein Interaction
Networks
Human Genome core protein interactions
Degree vs. Betweenness Centrality
43 interactions
Protein Ensembl ID
ENSG00000145332.2
Kelch-like protein 8
1e+0
Betweenness Centrality
1e-1
1e-2
1e-3
1e-4
1e-5
1e-6
1e-7
1
10
Degree
70
5/4/2009
Parallel Graph Algorithms
100
Talk Outline
• Applications
• Parallel algorithm building blocks
– Kernels
– Data structures
• Parallel algorithm case studies
– Connected components
– BFS/Shortest paths
– Betweenness centrality
• Performance on current systems
– Software
– Architectures
– Performance trends
71
5/4/2009
Parallel Graph Algorithms
Graph topology matters
• Information networks are very different from graph
topologies and computations that arise in scientific
computing.
Informatics: dynamic,
high-dimensional data
Static networks,
Euclidean topologies
Image Sources: visualcomplexity.com (1,2), MapQuest (3)
• Classical graph algorithms typically assume a uniform
random graph topology.
72
5/4/2009
Parallel Graph Algorithms
“Small-world” complex networks
Human Protein Interaction data set
Degree distribution (18669 proteins, 43568 interactions)
10000
73
5/4/2009
Parallel Graph Algorithms
1000
Frequency
• Information networks are
typically dynamic graph
abstractions, from diverse
data sources
• High-dimensional data
• Skewed (“power law”)
degree distribution of the
number of neighbors
• Low graph diameter
• Massive networks (billions
of entities)
100
10
1
0.1
1
10
100
1000
Degree
Kevin Bacon and six
degrees of separation
Image Source: Seokhee Hong
Implementation Challenges
• Execution time is dominated by latency to main
memory
– Large memory footprint
– Large number of irregular memory accesses
• Essentially no computation to hide memory
costs
• Poor performance on current cache-based
architectures (< 5-10% of peak achieved)
• Memory access pattern is dependent on the
graph topology.
• Variable degrees of concurrency in parallel
algorithm
74
5/4/2009
Parallel Graph Algorithms
Desirable HPC Architectural Features
• A global shared memory abstraction
– no need to partition the graph
– support dynamic updates
• A high-bandwidth, low-latency network
• Ability to exploit fine-grained parallelism
• Support for light-weight synchronization
• HPC systems with these characteristics
– Massively Multithreaded architectures
– Symmetric multiprocessors
75
5/4/2009
Parallel Graph Algorithms
Performance Results: Test Platforms
Cray XMT
•
Latency tolerance by massive
multithreading
–
–
–
–
–
hardware support for 128 threads on
each processor
Globally hashed address space
No data cache
Single cycle context switch
Multiple outstanding memory requests
•
Support for fine-grained,
word-level synchronization
• 16 x 500 MHz processors, 128 GB RAM
76
5/4/2009
Parallel Graph Algorithms
Sun Fire T5120
•
•
•
•
Sun Niagara2: Cache-based multicore
server with chip multithreading
1 socket x 8 cores x 8 threads per core
8 KB private L1 cache per core, 4 MB
shared L2 cache
1167 MHz processor, 32 GB RAM
Betweenness Centrality Performance
Approximate betweenness computation
on a synthetic small-world network of
256 million vertices and 2 billion edges
Cray XMT
Sun UltraSparcT2
Betweenness TEPS rate
(Millions of edges per second)
180
160
BC-new 2.2x faster than
previous approach
140
120
100
BC-new 2.4x faster than
BC-old
80
60
2.0 GHz quad-core Xeon
40
20
0
1
2
4
8
12
Number of processors/cores
TEPS: Traversed edges per second,
performance rate for centrality computation.
77
5/4/2009
Parallel Graph Algorithms
16
SNAP: Small-world network analysis
and Partitioning
•
•
•
•
New parallel framework for complex graph analysis
10-100x faster than existing approaches
Can process graphs with billions of vertices and edges
Open-source
snap-graph.sourceforge.net
78
5/4/2009
Parallel Graph Algorithms
Image Source: visualcomplexity.com
SNAP: Compact graph representations for
dynamic network analysis
14
Execution time per update
Relative Speedup
1200
12
1000
10
800
8
600
6
400
4
200
2
0
0
1 2
4
8
12
16
24
Number of threads
Graph: 25M vertices and 200M
edges, System: Sun Fire T2000
79
5/4/2009
Parallel Graph Algorithms
32
Relative Speedup
Execution time (nanoseconds per update)
1400
• New graph
representations for
dynamically evolving
small-world networks
in SNAP.
• We support fast,
parallel structural
updates to lowdiameter scale-free
and small-world
graphs.
SNAP: Induced Subgraphs Performance
90
12
75
10
60
8
45
6
30
4
15
2
0
0
1 2
4
8
12
16
24
Number of threads
Graph: 500M vertices and 2B
edges, System: IBM p5 570 SMP
80
5/4/2009
Parallel Graph Algorithms
32
Relative Speedup
Execution time (seconds)
Execution time
Relative Speedup
• Key kernel for
dynamic graph
computations
• We reduce
execution time of
linear-work kernels
from minutes to
seconds for massive
small-world
networks (billions of
vertices and edges)
Large-scale Graph Traversal
Problem
Graph
Multithreaded
BFS [BM06]
Random graph,
256M vertices, 1B
edges
2.3 sec (40p) Processes all low73.9 sec (1p) diameter graph
MTA-2
families
Random graph,
256M vertices, 1B
edges
8.9 hrs (3.2
GHz Xeon)
State-of-the-art
external memory BFS
11.96 sec
(40p) MTA-2
Works well for all lowdiameter graph
families
180 sec, 96p
2.0GHz cluster
Best known
distributed-memory
SSSP implementation
for large-scale graphs
External Memory
BFS [ADMO06]
Multithreaded SSSP Random graph,
[MBBC06]
256M vertices, 1B
edges
Parallel Dijkstra
[EBGL06]
81
5/4/2009
Random graph,
240M vertices, 1.2B
edges
Parallel Graph Algorithms
Result
Comments
Optimizations for real-world graphs
• Preprocessing kernels (connected components,
biconnected components, sparsification)
significantly reduce computation time.
– ex. A high number of isolated and degree-1 vertices
• store BFS/Shortest Path trees from high degree vertices and
reuse them
• Typically 3-5X performance improvement
• Exploit small-world network properties (low
graph diameter)
– Load balancing in the level-synchronous parallel BFS
algorithm
– SNAP data structures are optimized for unbalanced
degree distributions
82
5/4/2009
Parallel Graph Algorithms
Faster Community Identification Algorithms in SNAP:
Performance Improvement over the Girvan-Newman
approach
Parallelization
Algorithm Engineering
Performance Improvement
30
20
10
0
PPI
Citations
DBLP
NDwww
Small-world Network
Graphs: Real-world networks (order of
Millions), System: Sun Fire T2000
83
5/4/2009
Parallel Graph Algorithms
Actor
• Speedup from
Algorithm
Engineering
(approximate BC)
and parallelization
(Sun Fire T2000) are
multiplicative!
• 100-300X overall
performance
improvement over
Girvan-Newman
approach
Large-scale graph analysis: current research
• Synergistic combination of novel graph algorithms, high
performance computing, and algorithm engineering.
Novel approaches
Classical graph
Data stream
algorithms
algorithms
Complex Network
Analysis &
Empirical studies
Graph problems on
dynamic network
abstractions
Realistic modeling
Many-core
Spectral
techniques
Dynamic graph
algorithms
Stream Affordable exascale
computing
data storage
Enabling technologies
84
5/4/2009
Parallel Graph Algorithms
Review of lecture
• Applications: Internet and WWW, Scientific computing,
Data analysis, Surveillance
• Parallel algorithm building blocks
– Kernels: PRAM algorithms, Prefix sums, List ranking
– Data structures: graph representation, priority queues
• Parallel algorithm case studies
–
–
–
–
Connected components: graft and shortcut
BFS: level-synchronous approach
Shortest paths: parallel priority queue
Betweenness centrality: parallel algorithm with pseudo-code
• Performance on current systems
– Software: SNAP, Boost GL, igraph, NetworkX, Network Workbench
– Architectures: Cray XMT, cache-based multicore, SMPs
– Performance trends
85
5/4/2009
Parallel Graph Algorithms
Thank you!
• Questions?
86
5/4/2009
Parallel Graph Algorithms

Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –

Transcript Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –

Directory