ParallelGraphs
Download
Report
Transcript ParallelGraphs
Parallel Graph Algorithms
Sathish Vadhiyar
Graph Traversal
Graph search plays an important role in
analyzing large data sets
Relationship between data objects
represented in the form of graphs
Breadth first search used in finding
shortest path or sets of paths
Level-synchronized algorithm
Proceeds level-by-level starting with the
source vertex
Level of a vertex – its graph distance
from the source
How to decompose the graph (vertices,
edges and adjacency matrix) among
processors?
Distributed BFS with 1D
Partitioning
Each vertex and edges emanating from it
are owned by one processor
1-D partitioning of the adjacency matrix
Edges emanating from vertex v is its
edge list = list of vertex indices in row v
of adjacency matrix A
1-D Partitioning
At each level, each processor owns a set F –
set of frontier vertices owned by the
processor
Edge lists of vertices in F are merged to
form a set of neighboring vertices, N
Some vertices of N owned by the same
processor, while others owned by other
processors
Messages are sent to those processors to
add these vertices to their frontier set for
the next level
Lvs(v) – level of v, i.e,
graph distance from
source vs
2D Partitioning
P=RXC processor mesh
Adjacency matric divided into R.C block rows and C
block columns
A(i,j)(*) denotes a block owned by (i,j) processor;
each processor owns C blocks
2D Partitioning
Processor (i,j) owns vertices belonging to
block row (j-1).R+i
Thus a process stores some edges
incident on its vertices, and some edges
that are not
2D Paritioning
Assume that the edge list for a given vertex is the
column of the adjacency matrix
Each block in the 2D partitioning contains partial
edge lists
Each processor has a frontier set of vertices, F,
owned by the processor
2D Paritioning
Expand Operation
Consider v in F
The owner of v sends messages to
other processors in frontier column
telling that v is in the frontier; since
any of these processors may have
partial edge list of v
2D Partitioning
Fold Operation
Partial edge lists on each processor
merged to form N – potential vertices in
the next frontier
Vertices in N sent to their owners to
form new frontier set F on those
processors
These owner processors are in the same
processor row
This communication step referred as
fold operation
Analysis
Advantage of 2D over 1D – processorcolumn and processor-row
communications involve only R and C
processors
BFS on GPUs
BFS on GPUs
One GPU thread for a vertex
In each iteration, each vertex looks at
its entry in the frontier array
If true, it forms the neighbors and
frontiers
Severe load imbalance among the treads
Scope for improvement
Parallel Depth First Search
Easy to parallelize
Left subtree can be searched in parallel
with the right subtree
Statically assign a node to a processor –
the whole subtree rooted at that node
can be searched independently.
Can lead to load imbalance; Load
imbalance increases with the number of
processors
Dynamic Load Balancing (DLB)
Difficult to estimate the size of the
search space beforehand
Need to balance the search space among
processors dynamically
In DLB, when a processor runs out of
work, it gets work from another
processor
Maintaining Search Space
Each processor searches the space
depth-first
Unexplored states saved as stack; each
processor maintains its own local stack
Initially, the entire search space
assigned to one processor
Work Splitting
When a processor receives work request, it splits
its search space
Half-split: Stack space divided into two equal
pieces – may result in load imbalance
Giving stack space near the bottom of the stack
can lead to giving bigger trees
Stack space near the top of the stack tend to have
small trees
To avoid sending very small amounts of work –
nodes beyond a specified stack depth are not given
away – cutoff depth
Strategies
1. Send nodes near the bottom of the
stack
2. Send nodes near the cutoff depth
3. Send half the nodes between the
bottom of the stack and the cutoff
depth
Example: Figures 11.5(a) and 11.9
Load Balancing Strategies
Asynchronous round-robin: Each
processor has a target processor to get
work from; the value of the target is
incremented with modulo
Global round-robin: One single target
processor variable is maintained for all
processors
Random polling: randomly select a donor
Termination Detection
Dijikstra’s Token Termination Detection
Algorithm
Based on passing of a token in a logical ring;
P0 initiates a token when idle; A processor
holds a token until it has completed its work,
and then passes to the next processor; when
P0 receives again, then all processors have
completed
However, a processor may get more work
after becoming idle
Algorithm Continued….
Taken care of by using white and black
tokens
Initially, the token is white; a processor j
becomes black if it sends work to i<j
If j completes work, it changes token to
black and sends it to next processor; after
sending, changes to white.
When P0 receives a black token, reinitiates
the ring
Tree Based Termination
Detection
Uses weights
Initially processor 0 has weight 1
When a processor transfers work to another
processor, the weights are halved in both the
processors
When a processor finishes, weights are returned
Termination is when processor 0 gets back 1
Goes with the DFS algorithm; No separate
communication steps
Figure 11.10
Minimal Spanning Tree, Single-Source
and All-pairs Shortest Paths
Minimal Spanning Tree – Prim’s
Algorithm
Spanning tree of a graph, G (V,E) – tree
containing all vertices of G
MST – spanning tree with minimum sum
of weights
Vertices are added to a set Vt that
holds vertices of MST; Initially contains
an arbitrary vertex,r, as root vertex
Minimal Spanning Tree – Prim’s
Algorithm
An array d such that d[v in (V-Vt)] holds
weight of the edge with least weight
between v and any vertex in Vt; Initially
d[v] = w[r,v]
Find the vertex in d with minimum weight
and add to Vt
Update d
Time complexity – O(n2)
Parallelization
Vertex V and d array partitioned across P
processors
Each processor finds local minimum in d
Then global minimum across all d performed
by reduction on a processor
The processor finds the next vertex u, and
broadcasts to all processors
Parallelization
All processors update d; The owning
processor of u marks u as belonging to Vt
Process responsible for v must know w[u,v]
to update v; 1-D block mapping of adjacency
matrix
Complexity – O(n2/P) + (OnlogP) for
communication
Single Source Shortest Path –
Dijikistra’s Algorithm
Finds shortest path from the source
vertex to all vertices
Follows a similar structure as Prim’s
Instead of d array, an array l that
maintains the shortest lengths are
maintained
Follow similar parallelization scheme
Single Source Shortest Path on
GPUs
SSSP on GPUs
A single kernel is not enough since Ca
cannot be updated while it is accessed.
Hence costs updated in a temporary
array Ua
All-Pairs Shortest Paths
To find shortest paths between all pairs
of vertices
Dijikstra’s algorithm for single-source
shortest path can be used for all
vertices
Two approaches
All-Pairs Shortest Paths
Source-partitioned formulation: Partition the
vertices across processors
Works well if p<=n; No communication
Can at best use only n processors
Time complexity?
Source-parallel formulation: Parallelize SSSP for a
vertex across a subset of processors
Do for all vertices with different subsets of
processors
Hierarchical formulation
Exploits more parallelism
Time complexity?
All-Pairs Shortest Paths
Floyd’s Algorithm
Consider a subset S = {v1,v2,…,vk} of
vertices for some k <= n
Consider finding shortest path between
vi and vj
Consider all paths from vi to vj whose
intermediate vertices belong to the set
S; Let pi,j(k) be the minimum-weight path
among them with weight di,j(k)
All-Pairs Shortest Paths
Floyd’s Algorithm
If vk is not in the shortest path, then
pi,j(k) = pi,j(k-1)
If vk is in the shortest path, then the
path is broken into two parts – from vi to
vk, and from vk to vj
So di,j(k) = min{di,j(k-1) , di,k(k-1) + dk,j(k-1) }
The length of the shortest path from vi
to vj is given by di,j(n).
In general, solution is a matrix D(n)
Parallel Formulation
2-D Block Mapping
Processors laid in a 2D mesh
During kth iteration, each process Pi,j
needs certain segments of the kth row
and kth column of the D(k-1) matrix
For dl,r(k): following are needed
dl,k(k-1) (from a process along the same
process row)
dk,r(k-1) (from a process along the same
process column)
Figure 10.8
Parallel Formulation
2D Block Mapping
During kth iteration, each of the root(p)
processes containing part of the kth row
sends it to root(p)-1 in same column;
Similarly for the same row
Figure 10.8
Time complexity?
APSP on GPUs
Space complexity of Floyd’s algorithm is O(V2) –
Impossible to go beyond a few vertices on GPUs
Uses V2 threads
A single O(V) operation looping over O(V2) threads can exhibit slowdown due to high context switching
overhead between threads
Use Dijikistra’s – run SSSP algorithm from every
vertex in graph
Will require only the final output size to be O(V2)
Intermediate outputs on GPU can be O(V) and can
be copied to CPU memory
APSP on GPUs
Sources/References
Paper: A Scalable Distributed Parallel
Breadth-First Search Algorithm on
BlueGene/L. Yoo et al. SC 2005.
Paper:Accelerating large graph
algorithms on the GPU usingCUDA.
Harish and Narayanan. HiPC 2007.
Speedup Anomalies in DFS
The overall work (space searched) in
parallel DFS can be smaller or larger
than in sequential DFS
Can cause superlinear or sublinear
speedups
Figures 11.18, 11.19
Parallel Formulation
Pipelining
In the 2D formulation, the kth iteration
in all processes start only after k-1(th)
iteration completes in all the processes
A process can start working on the kth
iteration as soon as it has computed (k1)th iteration and has relevant parts of
the D(k-1) matrix
Example: Figure 10.9
Time complexity