Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267, Spring 2011 April 7, 2011

Download Report

Transcript Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267, Spring 2011 April 7, 2011

Parallel Graph Algorithms
Kamesh Madduri
[email protected]
Lawrence Berkeley National Laboratory
CS267, Spring 2011
April 7, 2011
Lecture Outline
• Applications
• Designing parallel graph algorithms, performance on
current systems
• Case studies: Graph traversal-based problems,
parallel algorithms
– Breadth-First Search
– Single-source Shortest paths
– Betweenness Centrality
Routing in transportation networks
Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
Internet and the WWW
• The world-wide web can be represented as a directed graph
– Web search and crawl: traversal
– Link analysis, ranking: Page rank and HITS
– Document classification and clustering
• Internet topologies (router networks) are naturally modeled
as graphs
Scientific Computing
• Reorderings for sparse solvers
– Fill reducing orderings
 Partitioning, eigenvectors
– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation
of sparsity
• Derivative computations for optimization
Image source: Yifan Hu, “A gallery of large graphs”
– graph colorings, spanning trees
• Preconditioning
– Incomplete Factorizations
– Partitioning for domain decomposition
– Graph techniques in algebraic multigrid
 Independent sets, matchings, etc.
– Support Theory
 Spanning trees & graph embedding techniques
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,
http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
Image source: Tim Davis, UF Sparse Matrix
Collection.
Large-scale data analysis
• Graph abstractions are very useful to analyze complex data
sets.
• Sources of data: petascale simulations, experimental devices,
the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data quality
Astrophysics: massive datasets,
temporal variations
Bioinformatics: data quality,
heterogeneity
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
Social Informatics: new analytics
challenges, data uncertainty
Data Analysis and Graph Algorithms in Systems Biology
• Study of the interactions between
various components in a
biological system
• Graph-theoretic formulations are
pervasive:
– Predicting new interactions:
modeling
– Functional annotation of novel
proteins: matching, clustering
– Identifying metabolic pathways:
paths, clustering
– Identifying new protein complexes:
clustering, centrality
Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
Graph –theoretic problems in social networks
– Targeted advertising: clustering and
centrality
– Studying the spread of information
Image Source: Nexus (Facebook application)
Network Analysis for Intelligence and Survelliance
• [Krebs ’04] Post 9/11 Terrorist
Network Analysis from public
domain information
• Plot masterminds correctly
identified from interaction
patterns: centrality
Image Source: http://www.orgnet.com/hijackers.html
• A global view of entities is
often more insightful
• Detect anomalous activities
by exact/approximate
subgraph isomorphism.
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies
for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
Research in Parallel Graph Algorithms
Application
Areas
Methods/
Problems
Social Network
Analysis
Find central entities
Community detection
Network dynamics
WWW
Computational
Biology
Marketing
Social Search
Gene regulation
Metabolic pathways
Genomics
Graph
Algorithms
Traversal
Data size
Shortest Paths
Connectivity
Problem
Complexity
Max Flow
…
Scientific
Computing
Graph partitioning
Matching
Coloring
…
Engineering
VLSI CAD
Route planning
…
Architectures
GPUs
FPGAs
x86 multicore
servers
Massively
multithreaded
architectures
Multicore
Clusters
Clouds
Characterizing Graph-theoretic computations
Input data
Problem: Find ***
• paths
• clusters
• partitions
• matchings
• patterns
• orderings
Graph kernel
• traversal
• shortest path
algorithms
• flow algorithms
• spanning tree
algorithms
• topological
sort
…..
Factors that influence
choice of algorithm
• graph sparsity (m/n ratio)
• static/dynamic nature
• weighted/unweighted, weight
distribution
• vertex degree distribution
• directed/undirected
• simple/multi/hyper graph
• problem size
• granularity of computation at
nodes/edges
• domain-specific characteristics
Graph problems are often recast as sparse
linear algebra (e.g., partitioning) or linear
programming (e.g., matching) computations
Lecture Outline
• Applications
• Designing parallel graph algorithms, performance on
current systems
• Case studies: Graph traversal-based problems,
parallel algorithms
– Breadth-First Search
– Single-source Shortest paths
– Betweenness Centrality
History
• 1735: “Seven Bridges of Königsberg” problem, resolved by
Euler, one of the first graph theory results.
• …
• 1966: Flynn’s Taxonomy.
• 1968: Batcher’s “sorting networks”
• 1969: Harary’s “Graph Theory”
• …
• 1972: Tarjan’s “Depth-first search and linear graph algorithms”
• 1975: Reghbati and Corneil, Parallel Connected Components
• 1982: Misra and Chandy, distributed graph algorithms.
• 1984: Quinn and Deo’s survey paper on “parallel graph
algorithms”
• …
The PRAM model
• Idealized parallel shared memory system
model
• Unbounded number of synchronous
processors; no synchronization,
communication cost; no parallel overhead
• EREW (Exclusive Read Exclusive Write), CREW
(Concurrent Read Exclusive Write)
• Measuring performance: space and time
complexity; total number of operations (work)
PRAM Pros and Cons
• Pros
– Simple and clean semantics.
– The majority of theoretical parallel algorithms are designed
using the PRAM model.
– Independent of the communication network topology.
• Cons
–
–
–
–
–
Not realistic, too powerful communication model.
Algorithm designer is misled to use IPC without hesitation.
Synchronized processors.
No local memory.
Big-O notation is often misleading.
PRAM Algorithms for Connected
Components
• Reghbati and Corneil [1975], O(log2n) time and
O(n3) processors
• Wyllie [1979], O(log2n) time and O(m)
processors
• Shiloach and Vishkin [1982], O(log n) time and
O(m) processors
• Reif and Spirakis [1982], O(log log n) time and
O(n) processors (expected)
Building blocks of classical PRAM graph algorithms
•
•
•
•
•
•
•
Prefix sums
Symmetry breaking
Pointer jumping
List ranking
Euler tours
Vertex collapse
Tree contraction
Data structures: graph representation
Static case
• Dense graphs (m = O(n2)): adjacency matrix commonly used.
• Sparse graphs: adjacency lists
Dynamic
• representation depends on common-case query
• Edge insertions or deletions? Vertex insertions or deletions?
Edge weight updates?
• Graph update rate
• Queries: connectivity, paths, flow, etc.
• Optimizing for locality a key design consideration.
Graph representation
Vertex Degree Adjacencies
• Compressed Sparse Row-like
5
0
8
7
1
3
4
6
9
Flatten
adjacency
arrays
2
Index into
adjacency 0
array
4
5
7
…
28
Adjacencies 2
5
7
7
6
0
Size: n+1
3
2
4
0
4
2
5
7
7
1
1
6
2
2
0
3
3
3
2
4
7
4
3
3
6
8
5
2
0
8
6
4
1
4
8
9
7
4
0
0
3
8
8
4
4
5
6
7
9
1
6
…. 6
7
6
Size: 2*m
Distributed Graph representation
• Each processor stores the entire graph (“full
replication”)
• Each processor stores n/p vertices and all adjacencies
out of these vertices (“1D partitioning”)
• How to create these “p” vertex partitions?
– Graph partitioning algorithms: recursively optimize for
conductance (edge cut/size of smaller partition)
– Randomly shuffling the vertex identifiers ensures that edge
count/processor are roughly the same
2D graph partitioning
• Consider a logical 2D processor grid (pr * pc = p) and
the matrix representation of the graph
• Assign each processor a sub-matrix (i.e, the edges
within the sub-matrix)
9 vertices, 9 processors, 3x3 processor grid
5
x
8
1
7
3
4
x
6
x
x
x
x
2
Flatten
Sparse matrices
x
x
x
x
x
x
Per-processor local graph
representation
x
x
x
0
x
x
x
x
x
x
x
x
x
Data structures in (parallel) graph
algorithms
• A wide range seen in graph algorithms: array,
list, queue, stack, set, multiset, tree
• Implementations are typically array-based for
performance considerations.
• Key data structure considerations in parallel
graph algorithm design
– Practical parallel priority queues
– Space-efficiency
– Parallel set/multiset operations, e.g., union,
intersection, etc.
Graph Algorithms on current systems
• Concurrency
– Simulating PRAM algorithms: hardware limits of memory
bandwidth, number of outstanding memory references;
synchronization
• Locality
– Try to improve cache locality, but avoid “too much”
superfluous computation
• Work-efficiency
– Is (Parallel time) * (# of processors) = (Serial work)?
The locality challenge
“Large memory footprint, low spatial and temporal locality
impede performance”
Performance rate
(Million Traversed Edges/s)
Serial Performance of “approximate betweenness centrality” on a 2.67
GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)
Input: Synthetic R-MAT graphs (# of edges m = 8n)
No Last-level Cache (LLC) misses
60
50
40
30
O(m) LLC misses
20
10
0
10
12
14
16
18
20
22
24
Problem Size (Log2 # of vertices)
26
~ 5X drop in
performance
The parallel scaling challenge
“Classical parallel graph algorithms perform poorly on current
parallel systems”
• Graph topology assumptions in classical algorithms do not
match real-world datasets
• Parallelization strategies at loggerheads with techniques for
enhancing memory locality
• Classical “work-efficient” graph algorithms may not fully
exploit new architectural features
– Increasing complexity of memory hierarchy, processor heterogeneity,
wide SIMD.
• Tuning implementation to minimize parallel overhead is nontrivial
– Shared memory: minimizing overhead of locks, barriers.
– Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation.
Lecture Outline
• Applications
• Designing parallel graph algorithms, performance on
current systems
• Case studies: Graph traversal-based problems,
parallel algorithms
– Breadth-First Search
– Single-source Shortest paths
– Betweenness Centrality
Graph traversal (BFS) problem definition
1
Input:
Output:
5
8
1
0
source
vertex
7
2
1
2
3
3
3
4
6
distance from
source vertex
4
4
9
1
2
Memory requirements (# of machine words):
• Sparse graph representation: m+n
• Stack of visited vertices: n
• Distance array: n
Optimizing BFS on cache-based multicore platforms,
for networks with “power-law” degree distributions
Problem Spec.
Assumptions
No. of vertices/edges
106 ~ 109
Edge/vertex ratio
1 ~ 100
Static/dynamic?
Static
Diameter
O(1) ~ O(log n)
Weighted/Unweighted
Unweighted
Vertex degree distribution
Unbalanced (“power law”)
Directed/undirected?
Both
Simple/multi/hypergraph?
Multigraph
Granularity of computation
at vertices/edges?
Minimal
(Data: Mislove et al.,
IMC 2007.)
Exploiting domain-specific
characteristics?
Partially
Synthetic R-MAT
networks
Test data
Parallel BFS Strategies
1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)
5
0
7
8
3
1
4
6
9
source
vertex
• O(D) parallel steps
• Adjacencies of all vertices
in current frontier are
visited in parallel
2
2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach,
suited for high-diameter graphs)
source
vertex
0
5
8
7
3
2
1
4
6
9
• path-limited searches
from “super vertices”
• APSP between “super
vertices”
A deeper dive into the “level synchronous” strategy
Locality (where are the random accesses originating from?)
53
84
93
0
31
44
74
26
63
11
1. Ordering of vertices in the “current
frontier” array, i.e., accesses to
adjacency indexing array,
cumulative accesses O(n).
2. Ordering of adjacency list of each
vertex, cumulative O(m).
3. Sifting through adjacencies to
check whether visited or not,
cumulative accesses O(m).
1. Access Pattern: idx array -- 53, 31, 74, 26
2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11
Performance Observations
Flickr social network
Youtube social network
60
60
# of vertices in frontier array
Execution time
50
Percentage of total
Percentage of total
50
40
30
20
10
0
# of vertices in frontier array
Execution time
40
30
20
10
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
1
2
3
4
5
Phase #
Graph expansion
6
7
8
9 10 11 12 13 14 15
Phase #
Edge filtering
Improving locality: Vertex relabeling
x x
x
x x
x x x
x
x x
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x
x
x
x x
x
x x
x
x x
x
x
x
x x
x x
x
• Well-studied problem, slight differences in problem formulations
– Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal
dense blocks.
– Databases/data mining: reordering bitmap indices for better compression;
permuting vertices of WWW snapshots, online social networks for
compression
•
NP-hard problem, several known heuristics
– We require fast, linear-work approaches
– Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit
overlap in adjacency lists, dimensionality reduction
Improving locality: Optimizations
• Recall: Potential O(m) non-contiguous memory
references in edge traversal (to check if vertex is visited).
– e.g., access order: 53, 31, 31, 26, 74, 84, 0, …
• Objective: Reduce TLB misses, private
cache misses, exploit shared cache.
53
84
93
0
• Optimizations:
1. Sort the adjacency lists of each vertex – helps
order memory accesses, reduce TLB misses.
1. Permute vertex labels – enhance spatial locality.
2. Cache-blocked edge visits – exploit temporal locality.
31
44
74
26
63
11
Improving locality: Cache blocking
Metadata denoting
blocking pattern
Adjacencies (d)
x
x x
x
x
x x
x x
x x
x x
x x
x
frontier
frontier
x x
Adjacencies (d)
3
2
1
x x
x
linear processing
x
x x
x
x
x x
x x
x
x x
x x
x
Process
high-degree
vertices
separately
Tune to L2
cache size
x
x
New: cache-blocked approach
• Instead of processing adjacencies of each vertex
serially, exploit sorted adjacency list structure w/
blocked accesses
• Requires multiple passes through the frontier array,
tuning for optimal block size.
• Note: frontier array size may be O(n)
Vertex relabeling heuristic
Similar to older heuristics, but tuned for small-world
networks:
1. High percentage of vertices with (out) degrees 0, 1, and 2
in social and information networks => store adjacencies
explicitly (in indexing data structure).

Augment the adjacency indexing data structure (with two
additional words) and frontier array (with one bit)
2. Process “high-degree vertices” adjacencies in linear order,
but other vertices with d-array cache blocking.
3. Form dense blocks around high-degree vertices

Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices
Architecture-specific Optimizations
1. Software prefetching on the Intel Core i7 (supports 32 loads and 20 stores
in flight)
– Speculative loads of index array and adjacencies of frontier vertices will
reduce compulsory cache misses.
2. Aligning adjacency lists to optimize memory accesses
– 16-byte aligned loads and stores are faster.
– Alignment helps reduce cache misses due to fragmentation
– 16-byte aligned non-temporal stores (during creation of new frontier) are fast.
3. SIMD SSE integer intrinsics to process “high-degree vertex” adjacencies.
4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics
have very low overhead)
5. Hugepage support (significant TLB miss reduction)
6. NUMA-aware memory allocation exploiting first-touch policy
Experimental Setup
Network
n
m
Max. outdegree
% of vertices w/ outdegree 0,1,2
Orkut
3.07M
223M
32K
5
LiveJournal
5.28M
77.4M
9K
40
Flickr
1.86M
22.6M
26K
73
Youtube
1.15M
4.94M
28K
76
R-MAT
8M-64M
8n
n0.6
Intel Xeon 5560 (Core i7, “Nehalem”)
• 2 sockets x 4 cores x 2-way SMT
• 12 GB DRAM, 8 MB shared L3
• 51.2 GBytes/sec peak bandwidth
• 2.66 GHz proc.
Performance averaged over
10 different source vertices, 3 runs each.
Impact of optimization strategies
Optimization
Generality
Impact*
Tuning
required?
(Preproc.) Sort adjacency lists
High
--
No
(Preproc.) Permute vertex labels
Medium
--
Yes
Preproc. + binning frontier vertices +
cache blocking
M
2.5x
Yes
Lock-free parallelization
M
2.0x
No
Low-degree vertex filtering
Low
1.3x
No
Software Prefetching
M
1.10x
Yes
Aligning adjacencies, streaming stores
M
1.15x
No
Fast atomic intrinsics
H
2.2x
No
* Optimization speedup (performance on 4 cores) w.r.t baseline
parallel approach, on a synthetic R-MAT graph (n=223,m=226)
Cache locality improvement
Performance count: # of non-contiguous memory accesses
(assuming cache line size of 16 words)
Fracction of theoretical peak
1.0
Baseline
Adj. sorted
Permute+cache blocked
0.8
0.6
0.4
0.2
0.0
Orkut LiveJournal Flickr
Youtube
Data set
Theoretical count of the number of noncontiguous memory accesses: m+3n
Performance rate
(Million Traversed Edges/s)
Parallel performance (Orkut graph)
1000
Cache-blocked BFS
Baseline
Execution time:
0.28 seconds (8 threads)
800
Parallel speedup: 4.9
600
Speedup over
baseline: 2.9
400
200
0
1
2
4
8
Number of threads
Graph: 3.07 million vertices, 220 million edges
Single socket of Intel Xeon 5560 (Core i7)
Performance Analysis
• How well does my implementation match theoretical
bounds?
• When can I stop optimizing my code?
– Begin with asymptotic analysis
– Express work performed in terms of machine-independent
performance counts
– Add input graph characteristics in analysis
– Use simple kernels or micro-benchmarks to provide an
estimate of achievable peak performance
– Relate observed performance to machine characteristics
Graph 500 “Search” Benchmark (graph500.org)
• BFS (from a single vertex) on a
static, undirected R-MAT network
with average vertex degree 16.
• Evaluation criteria: largest problem
size that can be solved on a system,
minimum execution time.
• Reference MPI, shared memory
implementations provided.
• NERSC Franklin system is ranked #2
on current list (Nov 2010).
– BFS using 500 nodes of Franklin
Lecture Outline
• Applications
• Designing parallel graph algorithms, performance on
current systems
• Case studies: Graph traversal-based problems,
parallel algorithms
– Breadth-First Search
– Single-source Shortest paths
– Betweenness Centrality
Parallel Single-source Shortest Paths
(SSSP) algorithms
• Edge weights: concurrency primary challenge!
• No known PRAM algorithm that runs in sub-linear time and
O(m+nlog n) work
• Parallel priority queues: relaxed heaps [DGST88], [BTZ98]
• Ullman-Yannakakis randomized approach [UY90]
• Meyer and Sanders, ∆ - stepping algorithm [MS03]
• Distributed memory implementations based on graph
partitioning
• Heuristics for load balancing and termination detection
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel
Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
∆ - stepping algorithm [MS03]
• Label-correcting algorithm: Can relax edges
from unsettled vertices also
• ∆ - stepping: “approximate bucket
implementation of Dijkstra’s algorithm”
• ∆: bucket width
• Vertices are ordered using buckets
representing priority range of size ∆
• Each bucket may be processed in parallel
∆ - stepping algorithm: illustration
∆ = 0.1
(say)
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
∞ ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
Initialization:
Insert s into bucket, d(s) = 0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
R
2
.01
S
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
2
.01
0
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
0
S
0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
.03 .01 .06
Buckets
0 1 3
4
5
6
∞ ∞ ∞
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S 0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
2
0
0.02
0.13
0.15
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 .03 .01 .06 .16 .29 .62
Buckets
1
4
2
6
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
6
0 2 1 3
Classify edges as
“heavy” and “light”
Relax light edges (phase)
Repeat until B[i] Is empty
Relax heavy edges. No
reinsertions in this step.
No. of phases (machine-independent performance count)
1000000
high
diameter
No. of phases
100000
10000
1000
low
diameter
100
10
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Average shortest path weight for various graph families
~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]
100000
Average shortest path weight
10000
1000
100
10
1
0.1
0.01
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Last non-empty bucket
(machine-independent performance count)
1000000
Last non-empty bucket
100000
10000
1000
Fewer buckets,
more parallelism
100
10
1
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Number of bucket insertions
(machine-independent performance count)
12000000
No. of Bucket insertions
10000000
8000000
6000000
4000000
2000000
0
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Lecture Outline
• Applications
• Designing parallel graph algorithms, performance on
current systems
• Case studies: Graph traversal-based problems,
parallel algorithms
– Breadth-First Search
– Single-source Shortest paths
– Betweenness Centrality
Betweenness Centrality
• Centrality: Quantitative measure to capture the
importance of a vertex/edge in a graph
– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
BC(v) 
 st (v)

s  v  t  st
(  st : No. of shortest paths between s and t)
• Applied to several real-world networks
–
–
–
–
Social interactions
WWW
Epidemiology
Systems biology
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and
number of shortest paths between all s-t pairs (O(n3) time),
sum up the fractional dependency values (O(n2) space).
• Brandes’ algorithm (2003): Augment a single-source shortest
path computation to count paths; uses the Bellman criterion;
O(mn) work and O(m+n) space.
Our New Parallel Algorithms
• Madduri, Bader (2006): parallel algorithms for computing
exact and approximate betweenness centrality
– low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))
– Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.
• Madduri et al. (2009): New parallel algorithm with lower
synchronization overhead and fewer non-contiguous
memory references
– In practice, 2-3X faster than previous algorithm
– Lock-free => better scalability on large parallel systems
Parallel BC Algorithm
• Consider an undirected, unweighted graph
• High-level idea: Level-synchronous parallel BreadthFirst Search augmented to compute centrality scores
• Two steps
– traversal and path counting
– dependency accumulation
 (v )
1   (w)
 (v )  
vP ( w)  ( w)
Parallel BC Algorithm Illustration
0
5
8
7
3
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
0
5
8
7
3
source
vertex
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
2
7
5
0
1
0
1
0
1
0
0
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
8
3
2
7
5
0
1
2
0
2
1
0
1
2
0
5
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
7
0
7
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,
their corresponding predecessor multisets, and D values.
5
0
7
source
vertex
2
8
3
1
4
6
9
S
D
2
1
6
4
8
3
2
7
5
0
0
P
1
2
1
1
2
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
6
0
2
3
0
8
0
5
6
7
8
0
7
Graph traversal step analysis
• Exploit concurrency in visiting adjacencies, as we assume that
the graph diameter is small: O(log n)
• Upper bound on size of each predecessor multiset: In-degree
• Potential performance bottlenecks: atomic updates to
predecessor multisets, atomic increments of path counts
• New algorithm: Based on observation that we don’t need to
store “predecessor” vertices. Instead, we store successor
edges along shortest paths.
– simplifies the accumulation step
– reduces an atomic operation in traversal step
– cache-friendly!
Graph Traversal Step locality analysis
for all vertices u at level d in parallel do
All the vertices are in a
contiguous block (stack)
for all adjacencies v of u in parallel do
All the adjacencies of a vertex are
dv = D[v];
stored compactly (graph rep.)
Non-contiguous
memory access
if (dv < 0)
Non-contiguous
vis = fetch_and_add(&Visited[v], 1);
memory access
if (vis == 0)
D[v] = d+1;
pS[count++] = v;
Non-contiguous
fetch_and_add(&sigma[v], sigma[u]);
memory access
fetch_and_add(&Scount[u], 1);
Store to S[u]
if (dv == d + 1)
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1);
Better cache utilization likely if D[v], Visited[v],
sigma[v] are stored contiguously
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack,
update dependence scores.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a
level-synchronous manner.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Accumulation step locality analysis
All the vertices are in a
contiguous block (stack)
for level d = GraphDiameter-2 to 1 do
for all vertices v at level d in parallel do
for all w in S[v] in parallel do reduction(delta)
delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
Each S[v] is a
contiguous
block
BC[v] = delta[v] = delta_sum_v;
Only floating point
operation in code
Centrality Analysis applied to
Protein Interaction Networks
Human Genome core protein interactions
Degree vs. Betweenness Centrality
43 interactions
Protein Ensembl ID
ENSG00000145332.2
Kelch-like protein 8
1e+0
Betweenness Centrality
1e-1
1e-2
1e-3
1e-4
1e-5
1e-6
1e-7
1
10
Degree
100
Designing parallel algorithms for large sparse graph analysis
System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.
Solution: Efficiently utilize available memory bandwidth.
Algorithmic innovation
to avoid corner cases.
Improve
locality
“RandomAccess”-like
Problem
Complexity
Locality
Data reduction/
Compression
“Stream”-like
Faster
methods
Work
performed
O(n)
O(nlog n)
O(n2)
104
106
108
1012
Problem size (n: # of vertices/edges)
Peta+
Review of lecture
• Applications: Internet and WWW, Scientific
computing, Data analysis, Surveillance
• Overview of parallel graph algorithms
– PRAM algorithms
– graph representation
• Parallel algorithm case studies
– BFS: locality, level-synchronous approach, multicore tuning
– Shortest paths: exploiting concurrency, parallel priority
queue
– Betweenness centrality: importance of locality, atomics
Thank you!
• Questions?