Exploiting Parallelism on
Parallelism on GPUs
• $100 NVIDIA video card 192 cores
– (Build Blacklight for ~$2000 ???)
• Incredibly low power
• Question: Use for general computation?
– General Purpose GPU (GPGPU)
• Very specific constraints
– Designed to be SIMD (e.g. shaders)
– Zero-overhead thread scheduling
– Little caching (compared to CPUs)
• Constantly stalled on memory access
• MASSIVE # of threads / core
• Much finer-grained threads (“kernels”)
• GPUs are SIMD
• How does multithreading work?
• Threads that branch are halted, then run
• Single Instruction Multiple….?
CUDA is an SIMT architecture
• Single Instruction Multiple Thread
• Threads in a block execute the same instruction
Fitting the data structures needed by the
threads in one multiprocessor requires
Example: MapReduce on CUDA
Too big for
cache on one SM!
Only one code branch within a block
executes at a time
If two multiprocessors share a cache line,
there are more memory accesses than