Parallel Programming with Nvidia’s CUDA Framework

Download Report

Transcript Parallel Programming with Nvidia’s CUDA Framework

Parallel Programming With Nvidia’s CUDA Framework Alan Andrews CS498 Spring 2009

In The Beginning • • • Speed increased exponentially since 1960’s Developers depend on faster CPU Limit reached in 2005

New Direction For CPU • • • Less speed, more cores Use of parallelism As of 2009, 4-core CPUs affordable

Enter The GPU • • • Already designed for parallelism ATI (now AMD) and NVidia driving more cores for past 10 years All gaming PCs have advanced GPU video cards

Scientists Use GPU • • • • Used graphics API Treated data as textures Required knowledge of graphics pipeline Difficult to use.

Nvidia Brings Us CUDA • • • • Introduced November of 2006 Converts GPU to general purpose CPU Required hardware changes – Only available on N70 or later GPU • GeForce 8000 series or newer Implemented as extension to C/C++ – Results in lower learning curve

GeForce 8800 Specs • • 16 Streaming Multiprocessors (SM) – Each one has 8 Streaming Processors (SP) – Each SM can execute 32 threads simultaneously – 512 threads execute per cycle – SPs hide instruction latencies 768 MB DRAM – 86.4 Gbps memory bandwidth to GPU cores – 4 Gbps memory bandwidth with system memory

N70 GPU Device Layout

Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Texture Parallel Data Cache Texture Parallel Data Cache Texture Parallel Data Cache Texture Parallel Data Cache Texture Parallel Data Cache Texture Parallel Data Cache Texture Load/store Load/store Load/store Load/store Global Memory Load/store Load/store

CUDA Execution Model • • • Starts with Kernel Kernel is function called on host that executes on GPU Thread resources are abstracted into 3 levels – Grid – highest level – Block – Collection of Threads – Thread – Execution unit

CUDA Execution Model

Host Kernel 1 Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (1,0,0) Thread (2,0,0) Thread (3,0,0) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread (3,1,0)

Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization.

CUDA Memory Model • • • 768 GB global memory – Accessible to all threads globally – 86.4 Gbps throughput 16 KB shared memory per SP – Accessible to all threads within a block – 384 Gbps throughput 32 KB register file per SM – Allocated to threads at runtime (local variables) – 384 Gbps throughput – Threads can only see their own registers

CUDA Memory Model

Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Host Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory

How Do You Execute CUDA Kernel?

• • • • • • (From C/C++ function) Allocate memory on CUDA device Copy data to CUDA device Configure thread resources – – Grid Layout (max 65536x65536) Block Layout (3 dimensional, max of 512 threads) Execute kernel with thread resources Copy data out of CUDA device Free memory on CUDA device

CUDA In Action: Matrix Multiplication • • • Multiply matrices M and N to form result R General algorithm – For each row i in matrix R • For each column j in matrix R – Cell (i, j) = dot product of row i of M and column j of N Algorithm runs in O(length 3 )

Matrix Multiplication On CUDA • • • • Each thread represents cell (i, j) Calculate value for cell (i, j) Use single block Should run in O(length) – Much better than O(length 3 )

Matrix Multiplication On CUDA

N M P WIDTH WIDTH

Matrix Multiplication On CUDA Code

Limitations With First Attempt • • Max threads allowed per block is 512.

Only supports max matrix size of 22x22 – 484 threads needed

CUDA Blocks • • • Split result matrix into smaller blocks Utilizes more SM’s rather than the single block approach Better speed-up

Blocks Diagram

Nd 0 bx 1 2 0 1 2 tx TILE_WIDTH-1 0 by 1 0 1 2 ty TILE_WIDTH-1 Md 2 WIDTH Pd Pd sub TILE_WIDTH WIDTH

Matrix Multiplication Using Blocks

Matrix Multiplication Speed Analysis • • Runs 10 times as fast as serial approach Solution runs 21.4 GFLOPS – GPU is capable of 384 GFLOPS – What gives?

How GPU Executes Code • • • • Each block assigned to SP – 8 SPs to 1 SM SM executes single SP SM switches SPs when long-latency is found – Works similar to Intel’s Hyperthreading SM executes batch of 32 threads at a time – Batch of 32 threads called warp.

GPU Constraints – Memory Speed • • • • Global Memory bandwidth is 86.4 Gbps Shared Memory bandwidth is 384 Gbps Register File bandwidth is 384 Gbps Key is to use shared memory and registers when possible

GPU Constraints – Memory Size • • • • Each SP has 16 KB shared memory Each SM has 32 KB register file Local variables in function take up registers Register file must support all threads in SM – If not enough registers, then less blocks are scheduled – Program still executes, but less parallelism occurs.

GPU Constraints – Thread Count • • • SM can only handle 768 threads SM can handle 8 blocks, 1 block for each SP Each block can have up to 96 threads – Max out SM resources

Matrix Multiplication: Second Attempt • • Copy portion of matrix to shared memory – Maximize data reuse Ensure thread block size is small enough to maximize parallelism

Matrix Multiplication: Second Attempt Code

Matrix Multiplication: Second Attempt

Second Attempt Results • • • Speed up of 6 over previous implementation 60 times faster than serial implementation Utilizes 120 GFLOPS

CUDA In The Field • • • MRI (currently requires a small cluster of PCs) Adobe Photoshop Stanford’s Folding@Home

Use CUDA Today • • Power of Cluster inside a single PC GeForce GTX285 Specs – 30 Streaming Multiprocessors • 960 threads executing simultaneously • 7680 threads executing hyper threading style.

– 1 GB DRAM • 159GBps Bandwidth