CONOR filter test
Download
Report
Transcript CONOR filter test
High performance computing on
the GPU: NVIDIA G80 and CUDA
Won-Ki Jeong, Ross Whitaker
SCI Institute
University of Utah
GPGPU
• General Purpose computation on the GPU
– Started in computer graphics community
– Mapping computation problems to graphics
rendering pipeline
Courtesy Jens Krueger and Aaron Lefohn
Why GPU for Computing?
• GPU is fast
– Massively parallel
• CPU : ~4 @ 3.0 Ghz (Intel Quad Core)
• GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX)
– High memory bandwidth
• CPU : 21 GB/s
• GPU : 86 GB/s
– Simple architecture optimized for compute intensive task
• Programmable
– Shaders, NVIDIA CUDA, ATI CTM
• High precision floating point support
– 32bit floating point IEEE 754
– 64bit floating point will be available in early 2008
Why GPU for computing?
• Inexpensive supercomputer
– Two NVIDIA Tesla D870 : 1 TFLOPS
• GPU hardware performance increases faster than CPU
GFLOPS
– Trend : simple, scalable architecture, interaction of clock speed,
cache, memory (bandwidth)
G80GL = Quadro FX 5600
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Courtesy NVIDIA
GPU is for Parallel Computing
• CPU
– Large cache and sophisticated flow control minimize
latency for arbitrary memory access for serial process
• GPU
– Simple flow control and limited cache, more
transistors for computing in parallel
– High arithmetic intensity hides memory latency
ALU
ALU
ALU
ALU
Control
Courtesy NVIDIA
Cache
DRAM
DRAM
CPU
GPU
GPU-friendly Problems
• High arithmetic intensity
– Computation must offset memory latency
• Coherent data access (e.g. structured
grids)
– Maximize memory bandwidth
• Data-parallel processing
– Same computation over large datasets (SIMD)
• E.g. convolution using a fixed kernel, PDEs
• Jacobi updates (isolate data stream read and write)
Traditional GPGPU Model
• GPU as a streaming processor (SIMD)
– Memory
• Textures
– Computation kernel
• Vertex / fragment shaders
– Programming
• Graphics API (OpenGL, DirectX), Cg, HLSL
• Example
– Render a screen-sized quad with a texture
mapping using a fragment shader
Graphics Pipeline
Texture
Vertex
Processor
Rasterizer
Fragment
Processor
Framebuffer
Problems of Traditional GPGPU Model
• Software limitation
–
–
–
–
High learning curve
Graphics API overhead
Inconsistency in API
Debugging is difficult
• Hardware limitation
– No general memory access (no scatter)
• B = A[i] : gather (O)
• A[i] = B : scatter (X)
– No integer/bitwise operations
– Memory access can be bottleneck
• Need coherent memory access for cache performance
NVIDIA G80 and CUDA
• New HW/SW architecture for computing on the
GPU
– GPU as massively parallel multithreaded machine :
one step further from streaming model
– New hardware features
•
•
•
•
Unified shaders (ALUs)
Flexible memory access
Fast user-controllable on-chip memory
Integer, bitwise operations
– New software features
• Extended C programming language and compiler
• Support debugging option (through emulation)
GPU : Highly Parallel Coprocessor
• GPU as a coprocessor that
– Has its own DRAM memory
– Communicate with host (CPU) through bus
(PCIx)
– Runs many threads in parallel
• GPU threads
– GPU threads are extremely lightweight
(almost no cost for creation/context switch)
– GPU needs at least several thousands threads
for full efficiency
• Hierarchy
–
–
–
–
Programming Model:
SPMD + SIMD
Host
Device = Grids
Grid = Blocks
Block = Warps
Warp = Threads
Device
Grid 1
Kernel
1
• Single kernel runs on
multiple blocks (SPMD)
• Single instruction executed
on multiple threads (SIMD)
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
– Warp size determines SIMD
granularity (G80 : 32 threads)
• Synchronization within a
block using shared memory
Courtesy NVIDIA
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Hardware Implementation :
a set of SIMD Processors
• Device
– a set of multiprocessors
• Multiprocessor
– a set of 32-bit SIMD
processors
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Processor 1
Processor 2
…
Instruction
Unit
Processor M
Courtesy NVIDIA
Memory Model
• Each thread can:
–
–
–
–
–
–
Read/write per-thread registers
Read/write per-thread local memory
Read/write per-block shared memory
Read/write per-grid global memory
Read only per-grid constant memory
Read only per-grid texture memory
• The host can read/write global,
constant, and texture memory
Host
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Courtesy NVIDIA
Local
Memory
Hardware Implementation :
Memory Architecture
• Device memory (DRAM)
– Slow (2~300 cycles)
– Local, global, constant,
and texture memory
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
• On-chip memory
Shared Memory
– Fast (1 cycle)
– Registers, shared memory,
constant/texture cache
Registers
Processor 1
Registers
Processor 2
Registers
…
Instruction
Unit
Processor M
Constant
Cache
Texture
Cache
Device memory
Courtesy NVIDIA
Memory Access Strategy
Copy data from global to shared memory
Synchronization
Computation (iteration)
Synchronization
Copy data from shared to global memory
Execution Model
• Each thread block is executed by a single
multiprocessor
– Synchronized using shared memory
• Many thread blocks are assigned to a single
multiprocessor
– Executed concurrently in a time-sharing fashion
– Keep GPU as busy as possible
• Running many threads in parallel can hide DRAM
memory latency
– Global memory access : 2~300 cycles
CUDA
• C-extension programming language
– No graphics API
• Flattens learning curve
• Better performance
– Support debugging tools
• Extensions / API
–
–
–
–
Function type : __global__, __device__, __host__
Variable type : __shared__, __constant__
cudaMalloc(), cudaFree(), cudaMemcpy(),…
__syncthread(), atomicAdd(),…
• Program types
– Device program (kernel) : run on the GPU
– Host program : run on the CPU to call device programs
Example: Vector Addition Kernel
// Pair-wise addition of vector elements
// One thread per addition
__global__ void
vectorAdd(float* iA, float* iB, float* oC)
{
int idx = threadIdx.x
+ blockDim.x * blockId.x;
oC[idx] = iA[idx] + iB[idx];
}
Courtesy NVIDIA
Example: Vector Addition Host Code
float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
// … initalize h_A and h_B
// allocate
float* d_A,
cudaMalloc(
cudaMalloc(
cudaMalloc(
device memory
d_B, d_C;
(void**) &d_A, N * sizeof(float) );
(void**) &d_B, N * sizeof(float) );
(void**) &d_C, N * sizeof(float) );
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice );
cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice );
// execute the kernel on N/256 blocks of 256 threads each
vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);
Courtesy NVIDIA
Compiling CUDA
• nvcc
C/C++ CUDA
Application
– Compiler driver
– Invoke cudacc, g++, cl
• PTX
CPU Code
NVCC
– Parallel Thread eXecution
ld.global.v4.f32
mad.f32
PTX Code
{$f1,$f3,$f5,$f7}, [$r9+0];
$f1, $f5, $f3, $f1;
PTX to Target
Compiler
G80
…
GPU
Target code
Courtesy NVIDIA
Debugging
• Emulation mode
– CUDA code can be compiled and run in emulation
mode (nvcc –deviceemu)
– No need of device or driver
– Each device thread is emulated with a host thread
– Can call host function from device code (e.g., printf)
– Support host debug function (breakpoint, inspection,
etc)
• Hardware debug will be available late 2007
Optimization Tips
• Avoid shared memory bank conflict
– Shared memory space is split into 16 banks
• Each bank is 4 bytes (32bit) wide
• Assigned round-robin fashion
– Any non-overlapped parallel bank access can
be done by a single memory operation
• Coalesced global memory access
– Contiguous memory address is fast
• a = b[thread_id]; // coalesced
• a = b[2*thread_id]; // non-coalesced
CUDA Enabled GPUs / OS
• Supported OS
– MS Windows, Linux
• Supported HW
– NVIDIA GeForce 8800 series
– NVIDIA Quadro 5600/4600
– NVIDIA Tesla series
Courtesy NVIDIA
ATI CTM (Close To Metal)
• Similar to CUDA PTX
– A set of native device instructions
• No compiler support
• Limited programming environment
Example: Fast Iterative Method
• CUDA implementation
– Tile size : 4x4x4
– Update active tile
• Neighbor access
– Manage active list
• Parallel reduction
Coalesced Global Memory Access
• Reordering
– Each tile is stored in global memory in
contiguous memory space
Non-coalesced
Coalesced
Update Active Tile
• Compute new solution
– Copy a tile and its neighbor pixels to shared
memory
– Avoid bank conflict
Tile (yellow)
Left
Right
Top
Bottom
Manage Active List
• Active list
– Simple 1D integer array of active tile indices
– Need to know which tile is NOT converged
Active points
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Active tiles
{1,2,5,7,8,
9,11,13,14}
Parallel Reduction of Convergence
in Tiles
Tile (1D view)
T
T
F
F
T
F
F
F
T
F
F
T
T = Converged
F = Not converged
F
F
F
Wrap up
• GPU computing is promising
– Many scientific computing problems are parallelizable
– More consistency/stability in HW/SW
• Streaming architectures are here to stay (and more so)
• Industry trend is multi/many core processor
– Better support/tools (easier to learn, maintain)
• Issues
–
–
–
–
–
No industry-wide standard
Market driven by gaming industry
Not every problem is suitable for GPUs
Re-engineer algorithms/software
Future performance growth????
• Impact on the data-analysis/interpretation workflow
Questions?