SIAM PP10 Presentation

Download Report

Transcript SIAM PP10 Presentation

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters SimuTools, Malaga, Spain March 16, 2010 Kalyan S. Perumalla, Ph.D.

Senior R&D Manager

Oak Ridge National Laboratory

Adjunct Professor

Georgia Institute of Technology

In a Nut Shell

Thread Multi-Node Node (Multi-GPU) Block GPU ........

........

........

Hierarchical Hardware • Multi-GPU • Multi-core • Network Agent-based Model Execution • Large scale • Fine-grained Challenges • Latency spectrum • Unified recursive solution

Dramatic improvements in speed

2 Managed by UT-Battelle for the U.S. Department of Energy

B2R Algorithm

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 1,0 P 1,0 Block 2,0 P 2,0

R

Block 1,1 P 1,1

B+2R

Block 2,1 P 2,1 Block 0,2 P 0,2 Block 1,2 P 1,2

R

Block 2,2 P 2,2 SimuTools10 Presentation – Perumalla (ORNL)

Outline

ABMS Computational Hierarchy B2R Algorithm • Definition, Examples, Larger sizes, Demo, Time stepped, Parallel style • Multi-GPU, Multi-CPU, MPI, CUDA, Access times, Latency problem • Basic idea, Hierarchical framework, Analysis equations, Cubic nature, Implementation Performance Study • CUDA, Pthreads, MPI, Lens cluster, Game of Life, Leadership, R vs. Improvement Future Work • Multi-GPU per node, OpenCL, More benchmarks, Unstructured inter-agent graphs 3 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

ABMS: Motivating Demonstrations

Agent Based Modeling and Simulation (ABMS)

Game of Life

Afghan Leadership GOL LDR

4 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU-based ABMS References

Examples:

K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent Directed Simulation Symposium, 2008

R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

5 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchical GPU System Hardware

Multi-Node Node (Multi-GPU) ........

GPU Block ........

Thread ........

6 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Computation Kernels on each GPU

E.g., CUDA Threads

Host initiates “launch” of many SIMD threads

Threads get “scheduled” in batches on GPU hardware

CUDA claims extremely efficient thread-launch implementation

– Millions of CUDA threads at once 7 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Memory Types (CUDA)

GPU memory comes in several flavors

Registers

Local Memory

Shared Memory

Constant Memory

Global Memory

Texture Memory

An important challenge is organizing the application to make most effective use of hierarchy

8 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

GPU Communication Latencies (CUDA)

Memory Type

Registers Shared Memory

Speed

Fastest (4 cycles) Very fast (4 -? cycles) Global Memory Local Memory 100x slower (400- cycles) 150x slower (600 cycles) Texture Memory Fast (10s of cycles) Constant Memory Fairly fast (read-only)

Scope

Thread Block Device Block Device Device

Lifetime

Kernel Thread Process Thread Process Process

Size

9 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

CUDA + MPI

An economical cluster solution

– Affordable GPUs, each providing one-node CUDA – MPI on giga-bit Ethernet for inter-node comm.

Memory speed-constrained system

– Inter-memory transfers can dominate runtime – Runtime overhead can be severe •

Need a way to tie CUDA and MPI

– Algorithmic solution needed – Need to overcome latency challenge 10 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Analogous Networked Multi-core System

Multi-Node Multi-Core Multi-Socket Thread ........

........

11 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Parallel Execution: Conventional Method

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 12 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0 Block 2,0 P 2,0 Block 1,1 P 1,1

B

Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL) Block 1,2 P 1,2 Block P 2,2 2,2

Latency Challenge: Conventional Method

High latency between GPU and CPU memories

– CUDA inter-memory data transfer primitives •

Very high latency across CPU memories

– MPI communication for data transfers •

Naïve method gives very poor computation to communication ratio

– Slow-downs instead of speedups •

Need latency resilient method …

13 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Our Solution:

B2R

Method

Block 0,0 P 0,0 Block 0,1 P 0,1 Block 0,2 P 0,2 14 Managed by UT-Battelle for the U.S. Department of Energy Block 1,0 P 1,0

R

Block 2,0 P 2,0 Block 1,1 P 1,1

B

Block 2,1 P 2,1 SimuTools10 Presentation – Perumalla (ORNL)

R

Block 1,2 P 1,2 Block 2,2 P 2,2

B2R Algorithm

Let T e be total number of iterations in the simulation

1 For all blocks Block ij in the given agent grid G 1.1 Let (t li , t lj ) be the top left index of Block ij 1.2 Let (b ri , b rj ) be the bottom right index of Block ij 1.3 For t=0 to T e /R 1.4 For r=R-1 down to 0 1.5 1.6 1.7 Update( t li -r, t lj -r, b ri +r, b rj +r ) Communicate( t li , t lj , b ri , b rj , r ) Barrier() Direction of error propagation in R iterations B×B sub-block mapped to processing element

p

B R layers of lagging cells 15 Managed by UT-Battelle for the U.S. Department of Energy R SimuTools10 Presentation – Perumalla (ORNL) R

Total Runtime Cost: Analytical Form

At any level in the hierarchy, total runtime F is given by:

F CPU

a

[ 4 3

R

3  (2

B

 2)

R

2  (

B

2  2

B

 2 3 )

R

B

2 ]  ] 

c F GPU

a

[ 4 3

R

3  (2

B

 2)

R

2  (

B

2  2

B

 2 3 )

R

B

2 ] 

b

[

B

 2

R

2 ] 

c

Most interesting aspect Cubic in R!

16 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Implications of being Cubic in R

R

Benefits with B2R not immediately seen for small R

– In fact, degradation for small R!

Dramatic improvement possible after small R

– Our experiments confirm this trend!

Too large is too bad too

– Can’t profit indefinitely!

17 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Sub-division Across Levels E.g., MPI to Blocks to Threads

m Multi-Node Node (Multi-GPU) ........

Block: R

b Block GPU ........

Thread: R

t Thread ........

18 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Hierarchy and Recursive Use of B & R

• •

B2R can be applied at all levels!

A different R can be chosen at every level, E.g.

– – R b R t for block-level R for thread-level R

Simple constraints exist for possible values of R

– – Between R and B Between R’s at different levels – Details in our paper

E.g., CUDA Hierarchy

Block Node (Multi-GPU) GPU Multi-Node ........

Thread ........

........

19 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

B2R Implementation within CUDA

Shared memory per block Global memory R state updates b×b blocks Split into b×b logical blocks B+2R R state updates

20 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Performance

150

Multi-Node GPU GOL - 16 m il Agents

100 50 0 1 2 Rt=1 Rt=2 4

MPI Level R (Rm )

Rt=4 8 Over 100 × speedup with MPI+CUDA 21 Speedup relative to naïve method with no latency-hiding for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL) 40 30 20 10 0

Multi-Node GPU LDR - 16 m il Agents

2 4

MPI Level R (Rm )

Rt=2 Rt=4 Rt=8 8

Multi-GPU MPI+CUDA – Game of Life

Multi-Node GPU LDR - 16 mil Agents

3500% 3000% 2500% 2000% 1500% 1000% 500% 0% 2 4 8

MPI Level R (Rm)

Rt=1 Rt=2 Rt=2 Rt=4 Rt=4 Rt=8 22 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads– Game of Life

250%

Multi-Node CPU GOL - 1 bil Agents

200% 150% 100% 50% 0% 1 2 3

MPI Level R (Rm)

Rt=1 Rt=2 Rt=4 4 23 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+Pthreads – Game of Life

Multi-Node CPU GOL - 1 Billion Agents

250% 200% 150% 100% 50% 0%

MPI Level R (Rm)

24 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Multi-core MPI+pthreads – Leadership

250%

Multi-Node CPU LDR - 1 bil Agents

200% 150% 100% 50% 0% 2 4

MPI Level R (Rm)

Rt=1 Rt=2 Rt=2 Rt=4 8 Rt=4 25 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Summary

B2R Algorithm applies across heterogeneous, hierarchical platforms

– Deep GPU hierarchies – Deep CPU multi-core systems •

Cubic nature of runtime dependence on R is a a remarkable aspect

– – A maximum and minimum exist Optimal (minimum) can be dramatically low •

Results show clear performance improvement

– Up to 150x in the best case (fine grained) 26 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Future Work

Generate cross-platform code

– E.g, Implement in OpenCL •

Add to CUDA-MPI levels

– Multi-GPU per node •

Implement and test with more benchmarks

– E.g., From existing ABMS suites NetLogo & Repast •

Generalize to unstructured inter-agent graphs

– E.g., Social networks •

Potential to apply to other domains

– E.g., Stencil computations 27 Managed by UT-Battelle for the U.S. Department of Energy SimuTools10 Presentation – Perumalla (ORNL)

Thank you!

Questions?

Additional material at our webpage: Discrete Computing Systems

www.ornl.gov/~2ip