Transcript SIAM PP10 Presentation
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters SimuTools, Malaga, Spain March 16, 2010 Kalyan S. Perumalla, Ph.D.
Senior R&D Manager
Oak Ridge National Laboratory
Adjunct Professor
Georgia Institute of Technology
In a Nut Shell
Thread Multi-Node Node (Multi-GPU) Block GPU ........
........
........
Hierarchical Hardware • Multi-GPU • Multi-core • Network Agent-based Model Execution • Large scale • Fine-grained Challenges • Latency spectrum • Unified recursive solution
Dramatic improvements in speed
2 Managed by UT-Battelle for the U.S. Department of Energy
B2R Algorithm
Block 0,0 P 0,0 Block 0,1 P 0,1 Block 1,0 P 1,0 Block 2,0 P 2,0
R
Block 1,1 P 1,1
B+2R
Block 2,1 P 2,1 Block 0,2 P 0,2 Block 1,2 P 1,2
R
Block 2,2 P 2,2 SimuTools10 Presentation – Perumalla (ORNL)
Outline
ABMS Computational Hierarchy B2R Algorithm • Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style • Multi-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem • Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation Performance Study • CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement Future Work • Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs 3 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
ABMS: Motivating Demonstrations
Agent Based Modeling and Simulation (ABMS)
•
Game of Life
•
Afghan Leadership GOL LDR
4 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU-based ABMS References
Examples:
K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent Directed Simulation Symposium, 2008
R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007
5 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Hierarchical GPU System Hardware
Multi-Node Node (Multi-GPU) ........
GPU Block ........
Thread ........
6 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Computation Kernels on each GPU
E.g., CUDA Threads
•
Host initiates “launch” of many SIMD threads
•
Threads get “scheduled” in batches on GPU hardware
•
CUDA claims extremely efficient thread-launch implementation
– Millions of CUDA threads at once 7 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU Memory Types (CUDA)
GPU memory comes in several flavors
Registers
Local Memory
Shared Memory
Constant Memory
Global Memory
Texture Memory
An important challenge is organizing the application to make most effective use of hierarchy
8 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
GPU Communication Latencies (CUDA)
Memory Type
Registers Shared Memory
Speed
Fastest (4 cycles) Very fast (4 -? cycles) Global Memory Local Memory 100x slower (400- cycles) 150x slower (600 cycles) Texture Memory Fast (10s of cycles) Constant Memory Fairly fast (read-only)
Scope
Thread Block Device Block Device Device
Lifetime
Kernel Thread Process Thread Process Process
Size
9 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
CUDA + MPI
•
An economical cluster solution
– Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm.
•
Memory speed-constrained system
– Inter-memory transfers can dominate runtime – Runtime overhead can be severe •
Need a way to tie CUDA and MPI
– Algorithmic solution needed – Need to overcome latency challenge 10 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Analogous Networked Multi-core System
Multi-Node Multi-Core Multi-Socket Thread ........
........
11 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Parallel Execution: Conventional Method
Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 12 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0 Block 2,0 P 2,0 Block 1,1 P 1,1
B
Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL) Block 1,2 P 1,2 Block P 2,2 2,2
Latency Challenge: Conventional Method
•
High latency between GPU and CPU memories
– CUDA inter-memory data transfer primitives •
Very high latency across CPU memories
– MPI communication for data transfers •
Naïve method gives very poor computation to communication ratio
– Slow-downs instead of speedups •
Need latency resilient method …
13 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Our Solution:
B2R
Method
Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 14 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0
R
Block 2,0 P 2,0 Block 1,1 P 1,1
B
Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL)
R
Block 1,2 P 1,2 Block 2,2 P 2,2
B2R Algorithm
Let T e be total number of iterations in the simulation
1 For all blocks Block ij in the given agent grid G 1.1 Let (t li , t lj ) be the top left index of Block ij 1.2 Let (b ri , b rj ) be the bottom right index of Block ij 1.3 For t=0 to T e /R 1.4 For r=R-1 down to 0 1.5 1.6 1.7 Update( t li -r, t lj -r, b ri +r, b rj +r ) Communicate( t li , t lj , b ri , b rj , r ) Barrier() Direction of error propagation in R iterations B×B sub-block mapped to processing element
p
B R layers of lagging cells 15 Managed by UT-Battelle for the U.S. Department of Energy R SimuTools10 Presentation – Perumalla (ORNL) R
Total Runtime Cost: Analytical Form
At any level in the hierarchy, total runtime F is given by:
F CPU
a
[ 4 3
R
3 (2
B
2)
R
2 (
B
2 2
B
2 3 )
R
B
2 ] ]
c F GPU
a
[ 4 3
R
3 (2
B
2)
R
2 (
B
2 2
B
2 3 )
R
B
2 ]
b
[
B
2
R
2 ]
c
Most interesting aspect Cubic in R!
16 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Implications of being Cubic in R
R
•
Benefits with B2R not immediately seen for small R
– In fact, degradation for small R!
•
Dramatic improvement possible after small R
– Our experiments confirm this trend!
•
Too large is too bad too
– Can’t profit indefinitely!
17 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Sub-division Across Levels E.g., MPI to Blocks to Threads
m Multi-Node Node (Multi-GPU) ........
Block: R
b Block GPU ........
Thread: R
t Thread ........
18 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Hierarchy and Recursive Use of B & R
• •
B2R can be applied at all levels!
A different R can be chosen at every level, E.g.
– – R b R t for block-level R for thread-level R
Simple constraints exist for possible values of R
– – Between R and B Between R’s at different levels – Details in our paper
E.g., CUDA Hierarchy
Block Node (Multi-GPU) GPU Multi-Node ........
Thread ........
........
19 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
B2R Implementation within CUDA
Shared memory per block Global memory R state updates b×b blocks Split into b×b logical blocks B+2R R state updates
20 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Performance
150
Multi-Node GPU GOL - 16 m il Agents
100 50 0 1 2 Rt=1 Rt=2 4
MPI Level R (Rm )
Rt=4 8 Over 100 × speedup with MPI+CUDA 21 Speedup relative to naïve method with no latency-hiding for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) 40 30 20 10 0
Multi-Node GPU LDR - 16 m il Agents
2 4
MPI Level R (Rm )
Rt=2 Rt=4 Rt=8 8
Multi-GPU MPI+CUDA – Game of Life
Multi-Node GPU LDR - 16 mil Agents
3500% 3000% 2500% 2000% 1500% 1000% 500% 0% 2 4 8
MPI Level R (Rm)
Rt=1 Rt=2 Rt=2 Rt=4 Rt=4 Rt=8 22 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+pthreads– Game of Life
250%
Multi-Node CPU GOL - 1 bil Agents
200% 150% 100% 50% 0% 1 2 3
MPI Level R (Rm)
Rt=1 Rt=2 Rt=4 4 23 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+Pthreads – Game of Life
Multi-Node CPU GOL - 1 Billion Agents
250% 200% 150% 100% 50% 0%
MPI Level R (Rm)
24 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Multi-core MPI+pthreads – Leadership
250%
Multi-Node CPU LDR - 1 bil Agents
200% 150% 100% 50% 0% 2 4
MPI Level R (Rm)
Rt=1 Rt=2 Rt=2 Rt=4 8 Rt=4 25 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Summary
•
B2R Algorithm applies across heterogeneous, hierarchical platforms
– Deep GPU hierarchies – Deep CPU multi-core systems •
Cubic nature of runtime dependence on R is a a remarkable aspect
– – A maximum and minimum exist Optimal (minimum) can be dramatically low •
Results show clear performance improvement
– Up to 150x in the best case (fine grained) 26 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Future Work
•
Generate cross-platform code
– E.g, Implement in OpenCL •
Add to CUDA-MPI levels
– Multi-GPU per node •
Implement and test with more benchmarks
– E.g., From existing ABMS suites NetLogo & Repast •
Generalize to unstructured inter-agent graphs
– E.g., Social networks •
Potential to apply to other domains
– E.g., Stencil computations 27 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)
Thank you!
Questions?
Additional material at our webpage: Discrete Computing Systems