Transcript ppt - The University of Akron
Prepared 6/23/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Your own PCs running G80 emulators Better debugging environment Sufficient for the first couple of weeks Your own PCs with a CUDA-enabled GPU NVIDIA boards in department GeForce family of processors for high-performance gaming Tesla C2070 for high-performance computing – no graphics output (?) and more memory CUDA at the University of Akron – Slide 2
Description
Low Power Consumer Graphics Processors 2 nd Generation GPUs Fermi GPUs
Card Models
Ion GeForce 8500GT GeForce 9500GT GeForce 9600GT GeForce GTX275 GeForce GTX480 Tesla C2070
Where Available
Netbooks in CAS 241.
Add-in cards in Dell Optiplex 745s in department.
In Dell Precision T3500s in department.
In select Dell Precision T3500s in department.
In Dell Precision T7500 Linux server ( tesla.cs.uakron.edu
) CUDA at the University of Akron – Slide 3
Basic building block is a “streaming multiprocessor” different chips have different numbers of these SMs:
Product
GeForce 8500GT GeForce 9500GT GeForce 9600GT
SMs
2 4 8
Compute Capability
v. 1.1
v. 1.1
v. 1.1
CUDA at the University of Akron – Slide 4
Basic building block is a “streaming multiprocessor” with 8 cores, each with 2048 registers up to 128 threads per core 16KB of shared memory 8KB cache for constants held in device memory different chips have different numbers of these SMs:
Product
GTX275
SMs
30
Bandwidth Memory
127 GB/s 1 -2 GB
Compute Capability
v. 1.3
CUDA at the University of Akron – Slide 5
each streaming multiprocessor has 32 cores, each with 1024 registers up to 48 threads per core 64KB of shared memory / L1 cache 8KB cache for constants held in device memory there’s also a unified 384KB L2 cache different chips again have different numbers of SMs:
Product SMs
GTX480 Tesla C2070 15 14
Bandwidth Memory
180 GB/s 140 GB/s 1.5 GB 6 GB ECC
Compute Capability
v. 2.0
v. 2.1
CUDA at the University of Akron – Slide 6
Feature
Integer atomic functions operating on 64-bit words in global memory Integer atomic functions operating on 32-bit words in shared memory Warp vote functions Double-precision floating-point operations
v. 1.1
no
v. 1.3, 2.x
yes no no no yes yes yes CUDA at the University of Akron – Slide 7
Feature
3D grid of thread block Floating-point atomic addition operating on 32-bit words in global and shared memory _ballot() _threadfence_system() _syncthread_count(), _syncthread_and(), _syncthread_or() Surface functions
v. 1.1, 1.3
no no no no no no
v. 2.x
yes yes yes yes yes yes CUDA at the University of Akron – Slide 8
Spec
Maximum x- or y- dimensions of a grid of thread blocks Maximum dimensionality of thread block Maximum z- dimension of a block Warp size Maximum number of resident blocks per multiprocessor Constant memory size Cache working set per multiprocessor for constant memory Maximum width for 1D texture reference bound to linear memory Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array Maximum number of textures that can be bound to a kernel Maximum number of instructions per kernel 65536 3 64 32 8 64 K 8 K 2 27 2048 x 2048 x 2048 128 2 million CUDA at the University of Akron – Slide 9
Spec
Maximum number of resident warps per multiprocessor Maximum number of resident threads per multiprocessor Number of 32-bit registers per multiprocessor
v. 1.1
24
v. 1.3
32
v. 2.x
48 768 1024 1536 8 K 16 K 32 K CUDA at the University of Akron – Slide 10
Spec
Maximum dimensionality of grid of thread block Maximum x- or y- dimension of a block Maximum number of threads per block Maximum amount of shared memory per multiprocessor Number of shared memory banks Amount of local memory per thread Maximum width for 1D texture reference bound to a CUDA array
v. 1.1, 1.3
2 512 512 16 K
v. 2.x
3 1024 1024 48 K 16 16 K 8192 32 512 K 32768 CUDA at the University of Akron – Slide 11
Spec
Maximum width and number of layers for a 1D layered texture reference Maximum width and height for 2D texture reference bound to linear memory or a CUDA array Maximum width, height, and number of layers for a 2D layered texture reference Maximum width for a 1D surface reference bound to a CUDA array Maximum width and height for a 2D surface reference bound to a CUDA array Maximum number of surfaces that can be bound to a kernel
v. 1.1, 1.3
8192 x 512 65536 x 32768 Not supported
v. 2.x
16384 x 2048 65536 x 65536 8192 x 8192 x 512 16384 x 16384 x 2048 8192 8192 x 8192 8 CUDA at the University of Akron – Slide 12
CUDA (Compute Unified Device Architecture) is NVIDIA’s program development environment: based on C with some extensions C++ support increasing steadily FORTRAN support provided by PGI compiler lots of example code and good documentation – 2-4 week learning curve for those with experience of OpenMP and MPI programming large user community on NVIDIA forums CUDA at the University of Akron – Slide 13
When installing CUDA on a system, there are 3 components: driver low-level software that controls the graphics card usually installed by sys-admin toolkit nvcc CUDA compiler some profiling and debugging tools various libraries usually installed by sys-admin in /usr/local/cuda CUDA at the University of Akron – Slide 14
SDK lots of demonstration examples a convenient Makefile for building applications some error-checking utilities not supported by NVIDIA almost no documentation often installed by user in own directory CUDA at the University of Akron – Slide 15
Remotely access the front end: ssh tesla.cs.uakron.edu
ssh sends your commands over an encrypted stream so your passwords, etc., can’t be sniffed over the network CUDA at the University of Akron – Slide 16
The first time you do this: After login, run /root/gpucomputingsdk_3.2.16_linux.run
and just take the default answers to get your own personal copy of the SDK.
Then: cd ~/NVIDIA_GPU_Computing_SDK/C make -j12 -k will build all that can be built.
CUDA at the University of Akron – Slide 17
The first time you do this: Binaries end up in: ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release In particular header file
Two choices: use nvcc within a standard Makefile use the special Makefile template provided in the SDK The SDK Makefile provides some useful options: make emu=1 uses an emulation library for debugging on a CPU make dbg=1 activates run-time error checking In general just use a standard Makefile CUDA at the University of Akron – Slide 19
GENCODE_ARCH := -gencode=arch=compute_10,code=\"sm_10,compute_10\“ -gencode=arch=compute_13,code=\"sm_13,compute_13\“ -gencode=arch=compute_20,code=\"sm_20,compute_20\“ INCLOCS := -I$(HOME)/NVIDIA_GPU_Computing_SDK/shared/inc -I$(HOME)/NVIDIA_GPU_Computing_SDK/C/common/inc LIBLOCS := -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -L$(HOME)/NVIDIA_GPU_Computing_SDK/C/lib LIBS = -lcutil_x86_64 < progName >: < progName >.cu < progName >.cu < progName >.cuh
nvcc $(GENCODE_ARCH) $(INCLOCS) < progName >.cu $(LIBLOCS) $(LIBS) -o < progName >
CUDA at the University of Akron – Slide 20
Parallel Thread Execution (PTX) Virtual machine and ISA Programming model Execution resources and state CUDA Tools and Threads – Slide 2
Any source file containing CUDA extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, … NVCC outputs C code (host CPU code) Must then be compiled with the rest of the application using another tool PTX Object code directly, or PTX source interpreted at runtime CUDA Tools and Threads – Slide 22
Any executable with CUDA code requires two dynamic libraries The CUDA runtime library (
cudart
) The CUDA core library (
cuda
) CUDA Tools and Threads – Slide 23
An executable compiled in device emulation mode (
nvcc –deviceemu
) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread CUDA Tools and Threads – Slide 24
Running in device emulation mode, one can Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice versa Call any host function from device code (e.g.
printf
) and vice-versa Detect deadlock situations caused by improper usage of
__syncthreads
CUDA Tools and Threads – Slide 25
Emulated device threads execute sequentially, so simultaneous access of the same memory location by multiple threads could produce different results Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode CUDA Tools and Threads – Slide 26
Results of floating-point computations will slightly differ because of Different compiler outputs, instructions sets Use of extended precision for intermediate results There are various options to force strict single precision on the host CUDA Tools and Threads – Slide 27
New Visual Studio Based GPU Integrated Development http://developer.nvidia.com/object/nexus.html
Available in Beta (as of October 2009) CUDA Tools and Threads – Slide 28
Based on original material from http://en.wikipedia.com/wiki/CUDA, accessed 6/22/2011.
The University of Akron: Charles Van Tilburg The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu Oxford University: Mike Giles Stanford University Jared Hoberock, David Tarjan Revision history: last updated 6/23/2011.
CUDA at the University of Akron – Slide 29