Parallel Graph Algorithms Kamesh Madduri KMadduri@lbl.gov Lawrence Berkeley National Laboratory CS267/EngC233 Spring 2010 April 15, 2010

Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267/EngC233 Spring 2010 April 15, 2010

Transcript Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267/EngC233 Spring 2010 April 15, 2010

Parallel Graph Algorithms
Kamesh Madduri
[email protected]
Lawrence Berkeley National Laboratory
CS267/EngC233 Spring 2010
April 15, 2010
Lecture Outline
• Applications
• Review of key results
• Case studies: Graph traversal-based problems,
parallel algorithms
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
Routing in transportation networks
Road networks, Point-to-point shortest paths: 15 seconds (naïve)  10 microseconds
H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
Internet and the WWW
• The world-wide web can be represented as a directed graph
– Web search and crawl: traversal
– Link analysis, ranking: Page rank and HITS
– Document classification and clustering
• Internet topologies (router networks) are naturally modeled
as graphs
Scientific Computing
• Reorderings for sparse solvers
– Fill reducing orderings
 Partitioning, eigenvectors
– Heavy diagonal to reduce pivoting (matching)
• Data structures for efficient exploitation
of sparsity
• Derivative computations for optimization
Image source: Yifan Hu, “A gallery of large graphs”
– Matroids, graph colorings, spanning trees
• Preconditioning
– Incomplete Factorizations
– Partitioning for domain decomposition
– Graph techniques in algebraic multigrid
 Independent sets, matchings, etc.
– Support Theory
 Spanning trees & graph embedding techniques
B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,
http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf
Image source: Tim Davis, UF Sparse Matrix
Collection.
Large-scale data analysis
• Graph abstractions are very useful to analyze complex data
sets.
• Sources of data: petascale simulations, experimental devices,
the Internet, sensor networks
• Challenges: data size, heterogeneity, uncertainty, data quality
Astrophysics: massive datasets,
temporal variations
Bioinformatics: data quality,
heterogeneity
Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
Social Informatics: new analytics
challenges, data uncertainty
Data Analysis and Graph Algorithms in Systems Biology
• Study of the interactions between
various components in a
biological system
• Graph-theoretic formulations are
pervasive:
– Predicting new interactions:
modeling
– Functional annotation of novel
proteins: matching, clustering
– Identifying metabolic pathways:
paths, clustering
– Identifying new protein complexes:
clustering, centrality
Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003.
Graph –theoretic problems in social networks
– Community identification: clustering
– Targeted advertising: centrality
– Information spreading: modeling
Image Source: Nexus (Facebook application)
Network Analysis for Intelligence and Survelliance
• [Krebs ’04] Post 9/11 Terrorist
Network Analysis from public
domain information
• Plot masterminds correctly
identified from interaction
patterns: centrality
Image Source: http://www.orgnet.com/hijackers.html
• A global view of entities is
often more insightful
• Detect anomalous activities
by exact/approximate
subgraph isomorphism.
Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies
for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
Research in Parallel Graph Algorithms
Application
Areas
Methods/
Problems
Social Network
Analysis
Find central entities
Community detection
Network dynamics
WWW
Computational
Biology
Marketing
Social Search
Gene regulation
Metabolic pathways
Genomics
Graph
Algorithms
Traversal
Data size
Shortest Paths
Connectivity
Problem
Complexity
Max Flow
…
Scientific
Computing
Graph partitioning
Matching
Coloring
…
Engineering
VLSI CAD
Route planning
…
Architectures
GPUs
FPGAs
x86 multicore
servers
Massively
multithreaded
architectures
Multicore
Clusters
Clouds
Characterizing Graph-theoretic computations (2/2)
Input data
Problem: Find ***
• paths
• clusters
• partitions
• matchings
• patterns
• orderings
Graph kernel
• traversal
• shortest path
algorithms
• flow algorithms
• spanning tree
algorithms
• topological
sort
…..
Factors that influence
choice of algorithm
• graph sparsity (m/n ratio)
• static/dynamic nature
• weighted/unweighted, weight
distribution
• vertex degree distribution
• directed/undirected
• simple/multi/hyper graph
• problem size
• granularity of computation at
nodes/edges
• domain-specific characteristics
Graph problems are often recast as sparse
linear algebra (e.g., partitioning) or linear
programming (e.g., matching) computations
Lecture Outline
• Applications
• Review of key results
• Graph traversal-based parallel algorithms, case
studies
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
History
• 1735: “Seven Bridges of Königsberg” problem, resolved by
Euler, first result in graph theory.
• …
• 1966: Flynn’s Taxonomy.
• 1968: Batcher’s “sorting networks”
• 1969: Harary’s “Graph Theory”
• …
• 1972: Tarjan’s “Depth-first search and linear graph algorithms”
• 1975: Reghbati and Corneil, Parallel Connected Components
• 1982: Misra and Chandy, distributed graph algorithms.
• 1984: Quinn and Deo’s survey paper on “parallel graph
algorithms”
• …
The PRAM model
• Idealized parallel shared memory system
model
• Unbounded number of synchronous
processors; no synchronization,
communication cost; no parallel overhead
• EREW, CREW
• Measuring performance: space and time
complexity; total number of operations (work)
The Helman-JaJa model
• Extension to the PRAM model for shared memory
algorithm design and analysis.
• T(n, p) is measured by the triplet –TM(n, p), TC(n, p),
B(n, p)
– TM(n, p): maximum number of non-contiguous main
memory accesses required by any processor
– TC(n, p): upper bound on the maximum local computational
complexity of any of the processors
– B(n, p): number of barrier synchronizations.
PRAM Pros and Cons
• Pros
– Simple and clean semantics.
– The majority of theoretical parallel algorithms are specified
with the PRAM model.
– Independent of the communication network topology.
• Cons
–
–
–
–
–
Not realistic, too powerful communication model.
Algorithm designer is misled to use IPC without hesitation.
Synchronized processors.
No local memory.
Big-O notation is often misleading.
Building blocks of classical PRAM graph algorithms
• Prefix sums
• List ranking
– Euler tours, Pointer jumping, Symmetry breaking
• Vertex collapse
• Tree contraction
Prefix Sums
• Input: A, an array of n elements; associative binary
operation i
• Output:  A( j ) for1  i  n
j 1
O(n) work, O(log n) time,
n processors
B(3,1)
C(3,2)
B(2,2)
C(2,2)
B(2,1)
C(2,1)
B(1,1)
C(1,1)
B(1,2)
C(1,2)
B(1,3)
C(1,3)
B(1,4)
C(1,4)
B(0,1) B(0,2) B(0,3) B(0,4) B(0,5) B(0,6) B(0,7) B(0,8)
C(0,1) C(0,2) C(0,3) C(0,4) C(0,5) C(0,6) C(0,7) C(0,8)
Parallel Prefix
• X: array of n elements stored in arbitrary order.
• For each element i, let X(i).value be its value and X(i).next be the index of
its successor.
• For binary associative operator Θ, compute X(i).prefix such that
– X(head).prefix = X (head).value, and
– X(i).prefix = X(i).value Θ X(predecessor).prefix
where
– head is the first element
– i is not equal to head, and
– predecessor is the node preceding i.
• List ranking: special case of parallel prefix, values initially set
to 1, and addition is the associative operator.
List ranking Illustration
• Ordered list (X.next values)
2
3
4
5
6
7
8
9
• Random list (X.next values)
4
6
5
7
8
3
2
9
List Ranking key idea
1. Chop X randomly into s pieces
2. Traverse each piece using a serial algorithm.
3. Compute the global rank of each element using the
result computed from the second step.
• In the Helman-JaJa model, TM(n,p) = O(n/p).
Connected Components
• Building block for many graph algorithms
– Minimum spanning tree, biconnected components,
planarity testing, etc.
• Representative of the “graft-and-shortcut” approach
• CRCW PRAM algorithms
– [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn) work
– [Gazit ’91]: randomized, optimal, O(log n) time.
• CREW algorithms
– [Han & Wagner ’90]: O(log2n) time, O((m+nlog n) logn)
work.
Shiloach-Vishkin algorithm
• Input: n isolated vertices and m PRAM processors.
• Each processor Pi grafts a tree rooted at vertex vi to the tree
that contains one of its neighbors u under the constraints u< vi
• Grafting creates k ≥ 1 connected subgraphs, and each
subgraph is then shortcut so that the depth of the trees
reduce at least by half.
shortcut
graft
4
2
4
2
1
3
1
3
1,4
2,3
• Repeat graft and shortcut until no more grafting is possible.
• Runs on arbitrary CRCW PRAM in O(log n) time with O(m)
processors.
• Helman-JaJa model: TM = (3m/p + 2)log n, TB = 4log n.
An example higher-level algorithm
Typically composed of several low-level efficient
algorithms.
Tarjan-Vishkin’s biconnected components algorithm:
O(log n) time, O(m+n) time.
1. Compute spanning tree T for the input graph G.
2. Compute Eulerian circuit for T.
3. Root the tree at an arbitrary vertex.
4. Preorder numbering of all the vertices.
5. Label edges using vertex numbering.
6. Connected components using the Shiloach-Vishkin
algorithm.
Data structures: graph representation
Static case
• Dense graphs (m = O(n2)): adjacency matrix commonly used.
• Sparse graphs: adjacency lists
Dynamic
• representation depends on common-case query
• Edge insertions or deletions? Vertex insertions or deletions?
Edge weight updates?
• Graph update rate
• Queries: connectivity, paths, flow, etc.
• Optimizing for locality a key design consideration.
Data structures in (parallel) graph
algorithms
• A wide range seen in graph algorithms: array,
list, queue, stack, set, multiset, tree
• Implementations are typically array-based for
performance considerations.
• Key data structure considerations in parallel
graph algorithm design
– Practical parallel priority queues
– Space-efficiency
– Parallel set/multiset operations, e.g., union,
intersection, etc.
Lecture Outline
• Applications
• Review of key results
• Case studies: Graph traversal-based problems,
parallel algorithms
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
Graph Algorithms on today’s systems
• Concurrency
– Simulating PRAM algorithms: hardware limits of
memory bandwidth, number of outstanding
memory references; synchronization
• Locality
• Work-efficiency
– Try to improve cache locality, but avoid “too
much” superfluous computation
The locality challenge
“Large memory footprint, low spatial and temporal locality
impede performance”
Performance rate
(Million Traversed Edges/s)
Serial Performance of “approximate betweenness centrality” on a 2.67
GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache)
Input: Synthetic R-MAT graphs (# of edges m = 8n)
No Last-level Cache (LLC) misses
60
50
40
30
O(m) LLC misses
20
10
0
10
12
14
16
18
20
22
24
Problem Size (Log2 # of vertices)
26
~ 5X drop in
performance
The parallel scaling challenge
“Classical parallel graph algorithms perform poorly on current
parallel systems”
• Graph topology assumptions in classical algorithms do not
match real-world datasets
• Parallelization strategies at loggerheads with techniques for
enhancing memory locality
• Classical “work-efficient” graph algorithms may not fully
exploit new architectural features
– Increasing complexity of memory hierarchy (x86), DMA support (Cell),
wide SIMD, floating point-centric cores (GPUs).
• Tuning implementation to minimize parallel overhead is nontrivial
– Shared memory: minimizing overhead of locks, barriers.
– Distributed memory: bounding message buffer sizes, bundling
messages, overlapping communication w/ computation.
Optimizing BFS on cache-based multicore platforms,
for networks with “power-law” degree distributions
Problem Spec.
Assumptions
No. of vertices/edges
106 ~ 109
Edge/vertex ratio
1 ~ 100
Static/dynamic?
Static
Diameter
O(1) ~ O(log n)
Weighted/Unweighted
Unweighted
Vertex degree distribution
Unbalanced (“power law”)
Directed/undirected?
Both
Simple/multi/hypergraph?
Multigraph
Granularity of computation
at vertices/edges?
Minimal
(Data: Mislove et al.,
IMC 2007.)
Exploiting domain-specific
characteristics?
Partially
Synthetic R-MAT
networks
Test data
Graph traversal (BFS) problem definition
1
Input:
Output:
5
8
1
0
source
vertex
7
2
1
2
3
3
3
4
6
distance from
source vertex
4
4
9
1
2
Memory requirements (# of machine words):
• Sparse graph representation: m+n
• Stack of visited vertices: n
• Distance array: n
Parallel BFS Strategies
1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)
5
0
7
8
3
1
4
6
9
source
vertex
• O(D) parallel steps
• Adjacencies of all vertices
in current frontier are
visited in parallel
2
2. Stitch multiple concurrent traversals (Ullman-Yannakakis approach,
suited for high-diameter graphs)
source
vertex
0
5
8
7
3
2
1
4
6
9
• path-limited searches
from “super vertices”
• APSP between “super
vertices”
A deeper dive into the “level synchronous” strategy
Locality (where are the random accesses originating from?)
53
84
93
0
31
44
74
26
63
11
1. Ordering of vertices in the “current
frontier” array, i.e., accesses to
adjacency indexing array,
cumulative accesses O(n).
2. Ordering of adjacency list of each
vertex, cumulative O(m).
3. Sifting through adjacencies to
check whether visited or not,
cumulative accesses O(m).
1. Access Pattern: idx array -- 53, 31, 74, 26
2,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11
Performance Observations
Flickr social network
Youtube social network
60
60
# of vertices in frontier array
Execution time
50
Percentage of total
Percentage of total
50
40
30
20
10
0
# of vertices in frontier array
Execution time
40
30
20
10
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
1
2
3
4
5
Phase #
Graph expansion
6
7
8
9 10 11 12 13 14 15
Phase #
Edge filtering
Improving locality: Vertex relabeling
x x
x
x x
x x x
x
x x
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x
x
x
x x
x
x x
x
x x
x
x
x
x x
x x
x
• Well-studied problem, slight differences in problem formulations
– Linear algebra: sparse matrix column reordering to reduce bandwidth, reveal
dense blocks.
– Databases/data mining: reordering bitmap indices for better compression;
permuting vertices of WWW snapshots, online social networks for
compression
•
NP-hard problem, several known heuristics
– We require fast, linear-work approaches
– Existing ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit
overlap in adjacency lists, dimensionality reduction
Improving locality: Optimizations
• Recall: Potential O(m) non-contiguous memory
references in edge traversal (to check if vertex is visited).
– e.g., access order: 53, 31, 31, 26, 74, 84, 0, …
• Objective: Reduce TLB misses, private
cache misses, exploit shared cache.
53
84
93
0
• Optimizations:
1. Sort the adjacency lists of each vertex – helps
order memory accesses, reduce TLB misses.
1. Permute vertex labels – enhance spatial locality.
2. Cache-blocked edge visits – exploit temporal locality.
31
44
74
26
63
11
Improving locality: Cache blocking
Metadata denoting
blocking pattern
Adjacencies (d)
x
x x
x
x
x x
x x
x x
x x
x x
x
frontier
frontier
x x
Adjacencies (d)
3
2
1
x x
x
linear processing
x
x x
x
x
x x
x x
x
x x
x x
x
Process
high-degree
vertices
separately
Tune to L2
cache size
x
x
New: cache-blocked approach
• Instead of processing adjacencies of each vertex
serially, exploit sorted adjacency list structure w/
blocked accesses
• Requires multiple passes through the frontier array,
tuning for optimal block size.
• Note: frontier array size may be O(n)
Vertex relabeling heuristic
Similar to older heuristics, but tuned for small-world
networks:
1. High percentage of vertices with (out) degrees 0, 1, and 2
in social and information networks => store adjacencies
explicitly (in indexing data structure).

Augment the adjacency indexing data structure (with two
additional words) and frontier array (with one bit)
2. Process “high-degree vertices” adjacencies in linear order,
but other vertices with d-array cache blocking.
3. Form dense blocks around high-degree vertices

Reverse Cuthill-McKee, removing degree 1 and degree 2 vertices
Architecture-specific Optimizations
1. Software prefetching on the Intel Core i7 (supports 32 loads and 20 stores
in flight)
– Speculative loads of index array and adjacencies of frontier vertices will
reduce compulsory cache misses.
2. Aligning adjacency lists to optimize memory accesses
– 16-byte aligned loads and stores are faster.
– Alignment helps reduce cache misses due to fragmentation
– 16-byte aligned non-temporal stores (during creation of new frontier) are fast.
3. SIMD SSE integer intrinsics to process “high-degree vertex” adjacencies.
4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics
have very low overhead)
5. Hugepage support (significant TLB miss reduction)
6. NUMA-aware memory allocation exploiting first-touch policy
Experimental Setup
Network
n
m
Max. outdegree
% of vertices w/ outdegree 0,1,2
Orkut
3.07M
223M
32K
5
LiveJournal
5.28M
77.4M
9K
40
Flickr
1.86M
22.6M
26K
73
Youtube
1.15M
4.94M
28K
76
R-MAT
8M-64M
8n
n0.6
Intel Xeon 5560 (Core i7, “Nehalem”)
• 2 sockets x 4 cores x 2-way SMT
• 12 GB DRAM, 8 MB shared L3
• 51.2 GBytes/sec peak bandwidth
• 2.66 GHz proc.
Performance averaged over
10 different source vertices, 3 runs each.
Impact of optimization strategies
Optimization
Generality
Impact*
Tuning
required?
(Preproc.) Sort adjacency lists
High
--
No
(Preproc.) Permute vertex labels
Medium
--
Yes
Preproc. + binning frontier vertices +
cache blocking
M
2.5x
Yes
Lock-free parallelization
M
2.0x
No
Low-degree vertex filtering
Low
1.3x
No
Software Prefetching
M
1.10x
Yes
Aligning adjacencies, streaming stores
M
1.15x
No
Fast atomic intrinsics
H
2.2x
No
* Optimization speedup (performance on 4 cores) w.r.t baseline
parallel approach, on a synthetic R-MAT graph (n=223,m=226)
Cache locality improvement
Performance count: # of non-contiguous memory accesses
(assuming cache line size of 16 words)
Fracction of theoretical peak
1.0
Baseline
Adj. sorted
Permute+cache blocked
0.8
0.6
0.4
0.2
0.0
Orkut LiveJournal Flickr
Youtube
Data set
Theoretical count of the number of noncontiguous memory accesses: m+3n
Performance rate
(Million Traversed Edges/s)
Parallel performance (Orkut graph)
1000
Cache-blocked BFS
Baseline
Execution time:
0.28 seconds (8 threads)
800
Parallel speedup: 4.9
600
Speedup over
baseline: 2.9
400
200
0
1
2
4
8
Number of threads
Graph: 3.07 million vertices, 220 million edges
Single socket of Intel Xeon 5560 (Core i7)
Lecture Outline
• Applications
• Review of key results
• Case studies: Graph traversal-based problems,
parallel algorithms
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
Parallel Single-source Shortest Paths
(SSSP) algorithms
• Edge weights: concurrency primary challenge!
• No known PRAM algorithm that runs in sub-linear time and
O(m+nlog n) work
• Parallel priority queues: relaxed heaps [DGST88], [BTZ98]
• Ullman-Yannakakis randomized approach [UY90]
• Meyer and Sanders, ∆ - stepping algorithm [MS03]
• Distributed memory implementations based on graph
partitioning
• Heuristics for load balancing and termination detection
K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel
Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm
Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
∆ - stepping algorithm [MS03]
• Label-correcting algorithm: Can relax edges
from unsettled vertices also
• ∆ - stepping: “approximate bucket
implementation of Dijkstra’s algorithm”
• ∆: bucket width
• Vertices are ordered using buckets
representing priority range of size ∆
• Each bucket may be processed in parallel
∆ - stepping algorithm: illustration
∆ = 0.1
(say)
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
∞ ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
Initialization:
Insert s into bucket, d(s) = 0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
0 0
R
2
.01
S
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 ∞ ∞ ∞ ∞ ∞ ∞
Buckets
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
2
.01
0
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
0 2
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
S
0
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
0 ∞
2
3
.01
∞ ∞ ∞ ∞
Buckets
4
5
6
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
1 3
.03 .06
0
S
0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
0
0.02
0.13
0.15
2
0.23
4
0.18
5
1
d array
0
1
2
3
0
.03 .01 .06
Buckets
0 1 3
4
5
6
∞ ∞ ∞
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S 0 2
∆ - stepping algorithm: illustration
0.05
3
0.56
0.07
0.01
2
0
0.02
0.13
0.15
0.23
4
0.18
5
1
d array
0
1
2
3
4
5
6
0 .03 .01 .06 .16 .29 .62
Buckets
1
4
2
6
5
6
One parallel phase
while (bucket is non-empty)
i) Inspect light edges
ii) Construct a set of “requests”
(R)
iii) Clear the current bucket
iv) Remember deleted vertices
(S)
v) Relax request pairs in R
Relax heavy request pairs (from S)
Go on to the next bucket
R
S
6
0 2 1 3
Classify edges as
“heavy” and “light”
Relax light edges (phase)
Repeat until B[i] Is empty
Relax heavy edges. No
reinsertions in this step.
No. of phases (machine-independent performance count)
1000000
high
diameter
No. of phases
100000
10000
1000
low
diameter
100
10
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Average shortest path weight for various graph families
~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1]
100000
Average shortest path weight
10000
1000
100
10
1
0.1
0.01
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Last non-empty bucket
(machine-independent performance count)
1000000
Last non-empty bucket
100000
10000
1000
Fewer buckets,
more parallelism
100
10
1
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Number of bucket insertions
(machine-independent performance count)
12000000
No. of Bucket insertions
10000000
8000000
6000000
4000000
2000000
0
Rnd-rnd
Rnd-logU Scale-free LGrid-rnd LGrid-logU
Graph Family
SqGrid
USAd NE USAt NE
Lecture Outline
• Applications
• Review of key results
• Case studies: Graph traversal-based problems,
parallel algorithms
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
Betweenness Centrality
• Centrality: Quantitative measure to capture the
importance of a vertex/edge in a graph
– degree, closeness, eigenvalue, betweenness
• Betweenness Centrality
BC(v) 
 st (v)

s  v  t  st
(  st : No. of shortest paths between s and t)
• Applied to several real-world networks
–
–
–
–
Social interactions
WWW
Epidemiology
Systems biology
Algorithms for Computing Betweenness
• All-pairs shortest path approach: compute the length and
number of shortest paths between all s-t pairs (O(n3) time),
sum up the fractional dependency values (O(n2) space).
• Brandes’ algorithm (2003): Augment a single-source shortest
path computation to count paths; uses the Bellman criterion;
O(mn) work and O(m+n) space.
Our New Parallel Algorithms
• Madduri, Bader (2006): parallel algorithms for computing
exact and approximate betweenness centrality
– low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))
– Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.
• Madduri et al. (2009): New parallel algorithm with lower
synchronization overhead and fewer non-contiguous
memory references
– In practice, 2-3X faster than previous algorithm
– Lock-free => better scalability on large parallel systems
Parallel BC Algorithm
• Consider an undirected, unweighted graph
• High-level idea: Level-synchronous parallel
Breadth-First Search augmented to compute
centrality scores
• Two steps
– traversal and path counting
– dependency accumulation
 (v )
1   (w)
 (v )  
vP ( w)  ( w)
Parallel BC Algorithm Illustration
0
5
8
7
3
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
0
5
8
7
3
source
vertex
2
1
4
6
9
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
2
7
5
0
1
0
1
0
1
0
0
Parallel BC Algorithm Illustration
1. Traversal step: visit adjacent vertices, update distance
and path counts.
S
5
0
7
source
vertex
2
8
3
6
P
0
1
4
D
9
8
3
2
7
5
0
1
2
0
2
1
0
1
2
0
5
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
7
0
7
Parallel BC Algorithm Illustration
1. Traversal step: at the end, we have all reachable vertices,
their corresponding predecessor multisets, and D values.
5
0
7
source
vertex
2
8
3
1
4
6
9
S
D
2
1
6
4
8
3
2
7
5
0
0
P
1
2
1
1
2
Level-synchronous approach: The adjacencies of all vertices
in the current frontier can be visited in parallel
6
0
2
3
0
8
0
5
6
7
8
0
7
Graph traversal step analysis
• Exploit concurrency in visiting adjacencies, as we assume that
the graph diameter is small: O(log n)
• Upper bound on size of each predecessor multiset: In-degree
• Potential performance bottlenecks: atomic updates to
predecessor multisets, atomic increments of path counts
• New algorithm: Based on observation that we don’t need to
store “predecessor” vertices. Instead, we store successor
edges along shortest paths.
– simplifies the accumulation step
– reduces an atomic operation in traversal step
– cache-friendly!
Graph Traversal Step locality analysis
for all vertices u at level d in parallel do
All the vertices are in a
contiguous block (stack)
for all adjacencies v of u in parallel do
All the adjacencies of a vertex are
dv = D[v];
stored compactly (graph rep.)
Non-contiguous
memory access
if (dv < 0)
Non-contiguous
vis = fetch_and_add(&Visited[v], 1);
memory access
if (vis == 0)
D[v] = d+1;
pS[count++] = v;
Non-contiguous
fetch_and_add(&sigma[v], sigma[u]);
memory access
fetch_and_add(&Scount[u], 1);
Store to S[u]
if (dv == d + 1)
fetch_and_add(&sigma[v], sigma[u]);
fetch_and_add(&Scount[u], 1);
Better cache utilization likely if D[v], Visited[v],
sigma[v] are stored contiguously
Parallel BC Algorithm Illustration
2. Accumulation step: Pop vertices from stack,
update dependence scores.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Parallel BC Algorithm Illustration
2. Accumulation step: Can also be done in a
level-synchronous manner.
S
5
0
7
source
vertex
2
8
3
1
4
6
9
2
1
6
4
8
3
2
7
5
0
 (v ) 
 (v )
1   (w)


(
w
)
vP ( w)
Delta P
6
0
2
3
0
8
0
5
6
7
8
0
7
Accumulation step locality analysis
All the vertices are in a
contiguous block (stack)
for level d = GraphDiameter-2 to 1 do
for all vertices v at level d in parallel do
for all w in S[v] in parallel do reduction(delta)
delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];
Each S[v] is a
contiguous
block
BC[v] = delta[v] = delta_sum_v;
Only floating point
operation in code
Centrality Analysis applied to
Protein Interaction Networks
Human Genome core protein interactions
Degree vs. Betweenness Centrality
43 interactions
Protein Ensembl ID
ENSG00000145332.2
Kelch-like protein 8
1e+0
Betweenness Centrality
1e-1
1e-2
1e-3
1e-4
1e-5
1e-6
1e-7
1
10
Degree
100
Lecture Outline
• Applications
• Review of key results
• Case studies: Graph traversal-based problems,
parallel algorithms
–
–
–
–
Breadth-First Search
Single-source Shortest paths
Betweenness Centrality
Community Identification
Community Identification
• Implicit communities in large-scale
networks are of interest in many
cases.
– WWW
– Social networks
– Biological networks
• Formulated as a graph clustering
problem.
– Informally, identify/extract “dense”
sub-graphs.
• Several different objective
functions exist.
– Metrics based on intra-cluster vs. intercluster edges, community sizes, number of
communities, overlap …
• Highly studied research problem
– 100s of papers yearly in CS, Social Sciences,
Physics, Comp. Biology, Applied Math
journals and conferences.
Agglomerative Clustering, Parallelization
• Bottom-up approach: Start with n singleton communities,
iteratively merge pairs to form larger communities.
– What measure to minimize/maximize? A measure known as modularity
– How do we order merges? Use a priority queue
• Parallelization: perform multiple “independent” merges
simultaneously.
Graph Analysis with cache-based multicore systems
• SNAP: Small-world Network Analysis and Partitioning.
• 10-100x faster than competing graph analysis software.
– Parallelism, heuristics exploiting graph topology.
• Can process graphs with billions of vertices and edges.
• Open-source: http://snap-graph.sf.net
• Optimizations for multicore systems: improving cache locality, minimize
synchronization overhead with atomics.
D.A. Bader and K. Madduri, IPDPS 2008, Parallel Computing 2008.
Designing fast parallel graph algorithms
System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.
Solution: Efficiently utilize available memory bandwidth.
Algorithmic innovation
to avoid corner cases.
Improve
locality
“RandomAccess”-like
Problem
Complexity
Locality
Data reduction/
Compression
“Stream”-like
Faster
methods
# of passes
over data
constant
log n
~n
104
106
108
1012
Data size (n: # of vertices/edges)
Peta+
Review of lecture
• Applications: Internet and WWW, Scientific
computing, Data analysis, Surveillance
• Earlier work on parallel graph algorithms
– PRAM algorithms, list ranking, connected components
– graph representations
• Parallel algorithm case studies
– BFS: locality, level-synchronous approach, multicore tuning
– Shortest paths: exploiting concurrency, parallel priority
queue
– Betweenness centrality: importance of locality, atomics
Future Research Challenges
• New methods/analytics: Modeling network
dynamics; persistent monitoring of
Analytics,
dynamically changing properties.
Summarization,
Visualization
• Software: Portable, high-performance,
extensible routines.
• Emerging systems: how do we best utilize BIG DATA
GPUs, flash disks?
Modeling &
Simulation
Thank you!
• Questions?

Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267/EngC233 Spring 2010 April 15, 2010

Transcript Parallel Graph Algorithms Kamesh Madduri [email protected] Lawrence Berkeley National Laboratory CS267/EngC233 Spring 2010 April 15, 2010

Directory