An Effective CPU Implementation of Breadth

Download Report

Transcript An Effective CPU Implementation of Breadth

An Effective GPU Implementation
of Breadth-First Search
Lijuan Luo, Martin Wong and Wen-mei Hwu
Department of Electrical and Computer
Engineering, UIUC
From DAC 2010
Outline

Introduction


Previous works


IIIT-BFS and a matrix-vector based BFS
The proposed GPU solution





Why BFS and why GPU?
Architecture of GPU, CUDA
Hierarchical queue and kernel
Synchronization
Examples
Experimental results and conclusions
Introduction



The graphic processing unit (GPU) has
become popular in parallel processing
for its cost-effectiveness.
Due to the architecture of GPU,
implementing the fastest CPU
algorithm on it directly may cause
huge overhead.
Hence, the speed-up is not always
meaningful.
Introduction (cont.)
Introduction (cont.)

Breadth-First Search has been widely
used in EDA.


Maze routing, circuit simulation, STA, etc.
Previous works are slower than the
fastest CPU program in some certain
types of graphs.
Breadth-First Search (BFS)



Given a graph G=(V,E) and a distinguished source vertex s.
BFS explores the edges of G to
discover every vertex that is reachable
from s.
Produce a breadth-first tree with root s
that contains all reachable vertices.
Example
BFS (cont.)


Traditional BFS algorithms use a
queue to store the frontiers.
Complexity is O(V+E)
Previous works


IIIT-BFS [2] is the first work
implementing BFS algorithm on GPU.
[3] uses matrix-vector multiplication
way to represent and do BFS.
IIIT-BFS




Point out that to maintain the frontier queue
can cause a huge overhead on the GPU.
For each level, IIIT-BFS exhaustively check
every vertex to see whether it belongs to the
current frontier.
O(VL+E), where L is the total number of
levels.
In sparse graph, L=E=O(V) and hence
O(VL+E)=O(V2)
BFS in Matrix-vector
multiplication




Accelerated a matrix-based BFS
algorithm for sparse graphs.
Each frontier propagation can be
transformed into a matrix-vector
multiplication.
O(V+EL), where L is the number of
levels.
In sparse graph, L=E=O(V) and hence
O(V+EL)=O(V2)
The proposed GPU solution

To propagate from all the frontier
vertices in parallel.



Since lots of EDA problems are
formulated as sparse graphs, the number
of neighbors of each frontier vertex is
less
Hierarchical queue and kernel.
Same complexity as the traditional
CPU implementation.
Architecture of Nvidia GTX280


A collection of 30 multiprocessors, with 8
streaming processors each.
The 30 multiprocessors share one off-chip
global memory.


Access time: about 300 clock cycles
Each multiprocessor has a on-chip memory
shared by that 8 streaming processors.

Access time: 2 clock cycles
Architecture diagram
Memory coalescing


Several memory transactions can be
coalesced into one transaction when
consecutive threads access
consecutive memory locations.
Due to access time of global memory
is relatively large, it is important to
achieve this.
CUDA programming




Compute Unified Device Architecture
The CPU code does the sequential
part.
Highly parallelized part usually
implement in the GPU code, called
kernel.
Calling GPU function in CPU code is
called kernel launch.
Hierarchical Queue Management

Hierarchical frontiers structure
Hierarchical Queue Management
(cont.)



G-Frontier: the frontier vertices shared
by all the threads of a grid.
B-Frontier: the frontier vertices
common to a whole block.
W-Frontier: the frontier vertices only
accessed by certain threads from a
warp.
Collision

Collision means more than one thread
are accessing the same queue at the
same time.


suppose only one queue and each SP
has a thread that is returning the new
frontier vertices
8 threads are accessing the same
queue—collision happens
Hierarchical Queue Management
(cont.)


Each W-Frontier maintains 8 queues so that no
collision will happen in a W-Frontier.
The scheduling unit – warp.


contains 32 threads each
four 8-thread groups
Synchronization issues




Correct results require thread
synchronization at the end of each level of
queue.
General solution: launch one kernel for each
level of queue and implement a global
barrier between two launched kernels.
If we do that, the kernel-launch overhead
will be huge.
CUDA only provides barrier function to
synchronize threads in a block.
Hierarchical Kernel
Arrangement



Only the highest layer uses this expensive
synchronization method (global barrier) and the
others use more efficient GPU synchronization.
Using intra-block synchronization (provide by
CUDA) to synchronize the threads in a block.
Using inter-block synchronization [10] to
synchronize threads between different blocks.

These two are GPU synchronization techniques that
the kernel does not need to terminated.
Example
Global barrier (kernel sync.)
Inter-block sync.
Intra-block sync.
Another example

Assume that there
are 100 vertices in
the queue.
1

…
…
100
First, launch a kernel.
It will create a grid
with a block with 512
thread (only 100 nonempty threads).
Grid
Block
Vertex 1 ~ 100
Empty threads
Another example (cont.)

Threads in that block will synchronize by
intra-block synchronization.
8 W-Frontiers and
one B-Frontier
Threads
Intra-block sync.
Threads
Another example (cont.)


Assume that after the computation, we get 1000 new
frontier vertices. 1000>512. It will use G-Frontier queue
to handle the whole vertices and continue.
Threads in different blocks will synchronize by interblock synchronization.
Inter-block sync.
Threads
Threads
Threads
Threads
Global memory
G-Frontier
Another example (cont.)

Once the new frontier vertices is larger
than 15360, the kernel is terminated
and re-launch a kernel with 15360
threads until this BFS-level is finished.
Experimental results

Environment



A dual socket, dual core 2.4 GHz Opteron
processor
8 GB of memory
A single NVIDIA GeForce GTX280 GPU
Experimental results (cont.)

The results on degree-6 regular graph.

grid-based graph like
Experimental results (cont.)

The results on real world graphs

Average deg(V)=2, maximum deg(V)=8 or 9
Experimental results (cont.)

The results on scale-free graphs


0.1% of the vertices have degree 1000
Other vertices have average degree of 6,
maximum degree of 7
Conclusions



This ideas proposed were never used
on other architecture.
Most suitable for sparse and nearregular graphs that is often used in
EDA.
Hierarchical queue management and
hierarchical kernel arrangement.