SIAM PP10 Presentation

Transcript SIAM PP10 Presentation

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters SimuTools, Malaga, Spain March 16, 2010 Kalyan S. Perumalla, Ph.D.

Senior R&D Manager

Oak Ridge National Laboratory

Adjunct Professor

Georgia Institute of Technology

In a Nut Shell

Thread Multi-Node Node (Multi-GPU) Block GPU ........

........

Hierarchical Hardware • Multi-GPU • Multi-core • Network Agent-based Model Execution • Large scale • Fine-grained Challenges • Latency spectrum • Unified recursive solution

Dramatic improvements in speed

2 Managed by UT-Battelle for the U.S. Department of Energy

B2R Algorithm

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 1,0 P 1,0 Block 2,0 P 2,0

Block 1,1 P 1,1

B+2R

Block 2,1 P 2,1 Block 0,2 P 0,2 Block 1,2 P 1,2

Block 2,2 P 2,2 SimuTools10 Presentation – Perumalla (ORNL)

Outline

ABMS Computational Hierarchy B2R Algorithm • Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style • Multi-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem • Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation Performance Study • CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement Future Work • Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs 3 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

ABMS: Motivating Demonstrations

Agent Based Modeling and Simulation (ABMS)

•

Game of Life

•

Afghan Leadership GOL LDR

4 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU-based ABMS References



Examples:



K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent Directed Simulation Symposium, 2008



R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

5 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchical GPU System Hardware

Multi-Node Node (Multi-GPU) ........

GPU Block ........

Thread ........

6 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Computation Kernels on each GPU

E.g., CUDA Threads

•

Host initiates “launch” of many SIMD threads

•

Threads get “scheduled” in batches on GPU hardware

•

CUDA claims extremely efficient thread-launch implementation

– Millions of CUDA threads at once 7 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Memory Types (CUDA)



GPU memory comes in several flavors



Registers



Local Memory



Shared Memory



Constant Memory



Global Memory



Texture Memory



An important challenge is organizing the application to make most effective use of hierarchy

8 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Communication Latencies (CUDA)

Memory Type

Registers Shared Memory

Speed

Fastest (4 cycles) Very fast (4 -? cycles) Global Memory Local Memory 100x slower (400- cycles) 150x slower (600 cycles) Texture Memory Fast (10s of cycles) Constant Memory Fairly fast (read-only)

Scope

Thread Block Device Block Device Device

Lifetime

Kernel Thread Process Thread Process Process

Size

9 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

CUDA + MPI

•

An economical cluster solution

– Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm.

•

Memory speed-constrained system

– Inter-memory transfers can dominate runtime – Runtime overhead can be severe •

Need a way to tie CUDA and MPI

– Algorithmic solution needed – Need to overcome latency challenge 10 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Analogous Networked Multi-core System

Multi-Node Multi-Core Multi-Socket Thread ........

........

11 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Parallel Execution: Conventional Method

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 12 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0 Block 2,0 P 2,0 Block 1,1 P 1,1

Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL) Block 1,2 P 1,2 Block P 2,2 2,2

Latency Challenge: Conventional Method

•

High latency between GPU and CPU memories

– CUDA inter-memory data transfer primitives •

Very high latency across CPU memories

– MPI communication for data transfers •

Naïve method gives very poor computation to communication ratio

– Slow-downs instead of speedups •

Need latency resilient method …

13 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Our Solution:

B2R

Method

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 14 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0

Block 2,0 P 2,0 Block 1,1 P 1,1

Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL)

Block 1,2 P 1,2 Block 2,2 P 2,2

B2R Algorithm

Let T e be total number of iterations in the simulation

1 For all blocks Block ij in the given agent grid G 1.1 Let (t li , t lj ) be the top left index of Block ij 1.2 Let (b ri , b rj ) be the bottom right index of Block ij 1.3 For t=0 to T e /R 1.4 For r=R-1 down to 0 1.5 1.6 1.7 Update( t li -r, t lj -r, b ri +r, b rj +r ) Communicate( t li , t lj , b ri , b rj , r ) Barrier() Direction of error propagation in R iterations B×B sub-block mapped to processing element

B R layers of lagging cells 15 Managed by UT-Battelle for the U.S. Department of Energy R SimuTools10 Presentation – Perumalla (ORNL) R

Total Runtime Cost: Analytical Form

At any level in the hierarchy, total runtime F is given by:

F CPU



[ 4 3

3  (2

 2)

2  (

2  2

 2 3 )



2 ]  ] 

c F GPU



[ 4 3

3  (2

 2)

2  (

2  2

 2 3 )



2 ] 

[

 2

2 ] 

Most interesting aspect Cubic in R!

16 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Implications of being Cubic in R

•

Benefits with B2R not immediately seen for small R

– In fact, degradation for small R!

•

Dramatic improvement possible after small R

– Our experiments confirm this trend!

•

Too large is too bad too

– Can’t profit indefinitely!

17 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Sub-division Across Levels E.g., MPI to Blocks to Threads

m Multi-Node Node (Multi-GPU) ........

Block: R

b Block GPU ........

Thread: R

t Thread ........

18 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchy and Recursive Use of B & R

• •

B2R can be applied at all levels!

A different R can be chosen at every level, E.g.

– – R b R t for block-level R for thread-level R

Simple constraints exist for possible values of R

– – Between R and B Between R’s at different levels – Details in our paper

E.g., CUDA Hierarchy

Block Node (Multi-GPU) GPU Multi-Node ........

Thread ........

........

19 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

B2R Implementation within CUDA

Shared memory per block Global memory R state updates b×b blocks Split into b×b logical blocks B+2R R state updates

20 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Performance

150

Multi-Node GPU GOL - 16 m il Agents

100 50 0 1 2 Rt=1 Rt=2 4

MPI Level R (Rm )

Rt=4 8 Over 100 × speedup with MPI+CUDA 21 Speedup relative to naïve method with no latency-hiding for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) 40 30 20 10 0

Multi-Node GPU LDR - 16 m il Agents

2 4

MPI Level R (Rm )

Rt=2 Rt=4 Rt=8 8

Multi-GPU MPI+CUDA – Game of Life

Multi-Node GPU LDR - 16 mil Agents

3500% 3000% 2500% 2000% 1500% 1000% 500% 0% 2 4 8

MPI Level R (Rm)

Rt=1 Rt=2 Rt=2 Rt=4 Rt=4 Rt=8 22 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads– Game of Life

250%

Multi-Node CPU GOL - 1 bil Agents

200% 150% 100% 50% 0% 1 2 3

MPI Level R (Rm)

Rt=1 Rt=2 Rt=4 4 23 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+Pthreads – Game of Life

Multi-Node CPU GOL - 1 Billion Agents

250% 200% 150% 100% 50% 0%

MPI Level R (Rm)

24 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads – Leadership

250%

Multi-Node CPU LDR - 1 bil Agents

200% 150% 100% 50% 0% 2 4

MPI Level R (Rm)

Rt=1 Rt=2 Rt=2 Rt=4 8 Rt=4 25 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Summary

•

B2R Algorithm applies across heterogeneous, hierarchical platforms

– Deep GPU hierarchies – Deep CPU multi-core systems •

Cubic nature of runtime dependence on R is a a remarkable aspect

– – A maximum and minimum exist Optimal (minimum) can be dramatically low •

Results show clear performance improvement

– Up to 150x in the best case (fine grained) 26 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Future Work

•

Generate cross-platform code

– E.g, Implement in OpenCL •

Add to CUDA-MPI levels

– Multi-GPU per node •

Implement and test with more benchmarks

– E.g., From existing ABMS suites NetLogo & Repast •

Generalize to unstructured inter-agent graphs

– E.g., Social networks •

Potential to apply to other domains

– E.g., Stencil computations 27 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Thank you!

Questions?

Additional material at our webpage: Discrete Computing Systems

SIAM PP10 Presentation

Transcript SIAM PP10 Presentation

In a Nut Shell

B2R Algorithm

Outline

ABMS: Motivating Demonstrations

GPU-based ABMS References

Hierarchical GPU System Hardware

Computation Kernels on each GPU

GPU Memory Types (CUDA)

GPU Communication Latencies (CUDA)

CUDA + MPI

Analogous Networked Multi-core System

Parallel Execution: Conventional Method

Latency Challenge: Conventional Method

Our Solution:

Method

B2R Algorithm

Total Runtime Cost: Analytical Form

Most interesting aspect Cubic in R!

Implications of being Cubic in R

Sub-division Across Levels E.g., MPI to Blocks to Threads

Block: R

Thread: R

Hierarchy and Recursive Use of B & R

B2R Implementation within CUDA

Performance

Multi-GPU MPI+CUDA – Game of Life

Multi-core MPI+pthreads– Game of Life

Multi-core MPI+Pthreads – Game of Life

Multi-core MPI+pthreads – Leadership

Summary

Future Work

Thank you!

Questions?

www.ornl.gov/~2ip

Directory