02-gpu_presentation...

Transcript 02-gpu_presentation...

Exploiting Parallelism on
GPUs
Matt Mukerjee
David Naylor
Parallelism on GPUs
• $100 NVIDIA video card  192 cores
– (Build Blacklight for ~$2000 ???)
• Incredibly low power
• Ubiquitous
?
=
• Question: Use for general computation?
– General Purpose GPU (GPGPU)
GPU Hardware
• Very specific constraints
– Designed to be SIMD (e.g. shaders)
– Zero-overhead thread scheduling
– Little caching (compared to CPUs)
• Constantly stalled on memory access
• MASSIVE # of threads / core
• Much finer-grained threads (“kernels”)
CUDA
Architecture
Thread Blocks
• GPUs are SIMD
• How does multithreading work?
• Threads that branch are halted, then run
• Single Instruction Multiple….?
CUDA is an SIMT architecture
• Single Instruction Multiple Thread
• Threads in a block execute the same instruction
Multi-threaded
Instruction Unit
Observation
Fitting the data structures needed by the
threads in one multiprocessor requires
application-specific tuning.
Example: MapReduce on CUDA
Too big for
cache on one SM!
Problem
Only one code branch within a block
executes at a time
Enhancing SIMT
Problem
If two multiprocessors share a cache line,
there are more memory accesses than
necessary.
Data Reordering

02-gpu_presentation...

Transcript 02-gpu_presentation...

Directory