Transcript PPT
CS179: GPU Programming
Lecture 8: More CUDA Runtime
Today
CUDA arrays for textures
CUDA runtime
Helpful CUDA functions
CUDA Arrays
Recall texture memory
Used to store large data
Stored on GPU
Accessible to all blocks,
threads
CUDA Arrays
Used Texture memory for buffers (lab 3)
Allows vertex data to remain on GPU
How else can we access texture memory?
CUDA arrays
CUDA Arrays
Why CUDA arrays over normal arrays?
Better caching, 2D caching
Spatial locality
Supports wrapping/clamping
Supports filtering
CUDA Linear Textures
“Textures” but in global memory
Usage:
Step 1: Create texture reference
texture<TYPE> tex
TYPE = float, float3, int, etc.
Step 2: Bind memory to texture reference
cudaBindTexture(offset, tex, devPtr, size);
Step 3: Get data on device via tex1Dfetch
tex1DFetch(tex, x);
x is the byte where we want to read!
Step 4: Clean up after finished
cudaUnbindTexture(&tex)
CUDA Linear Textures
Texture reference properties:
texRef<type, dim, mode>
type = float, int, float3, etc.
dim = # of dimensions (1, 2, or 3)
mode =
cudaReadModeElementType: standard read
cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats
CUDA Linear Textures
Important warning:
Textures are in a global space of memory
Threads can read and write to texture at same time
This can cause synchronization problems!
Do not rely on thread running order, ever
CUDA Linear Textures
Other limitations:
Only 1D, can make indexing and caching a bit less convenient
Pitch may be not ideal for 2D array
Not read-write
Solution: CUDA arrays
CUDA Arrays
Live in texture memory space
Access via texture fetches
CUDA Arrays
Step 1: Create channel description
Tells us texture attributes
cudaCreateChannelDesc(int x, int y, int z, int w, enum mode)
x, y, z, w are number of bytes per component
mode is cudaChannelFormatKindFloat, etc.
CUDA Arrays
Step 2: Allocate memory
Must be done dynamically
Use cudaMallocArray(cudaArray **array, struct desc, int size)
Most global memory functions work with CUDA arrays too
cudaMemcpyToArray, etc.
CUDA Arrays
Step 3: Create texture reference
texture<TYPE, dim, mode> texRef -- just as before
Parameters must match channel description where applicable
Step 4: Edit texture settings
Settings are encoded as texRef struct members
CUDA Arrays
Step 5: Bind the texture reference to array
cudaBindTextureToArray(texRef, array)
Step 6: Access texture
Similar to before, now we have more options:
tex1DFetch(texRef, x)
tex2DFetch(texRef, x, y)
CUDA Arrays
Final Notes:
Coordinates can be normalized to [0, 1] if in float mode
Filter modes: nearest point or linear
Tells CUDA how to blend texture
Wrap vs. clamp:
Wrap: out of bounds accesses wrap around to other side
Ex.: (1.5, 0.5) -> (0.5, 0.5)
Clamp: out of bounds accesses set to border value
Ex.: (1.5, 0.5) -> (1.0, 0.5)
CUDA Arrays
point sampling
linear sampling
CUDA Arrays
wrap
clamp
CUDA Runtime
Nothing new, every function cuda____ is part of the runtime
Lots of other helpful functions
Many runtime functions based on making your program
robust
Check properties of card, set up multiple GPUs, etc.
Necessary for multi-platform development!
CUDA Runtime
Starting the runtime:
Simply call a cuda_____ function!
CUDA can waste a lot of resources
Stop CUDA with cudaThreadExit()
Called automatically on CPU exit, but you may want to call earlier
CUDA Runtime
Getting devices and properties:
cudaGetDeviceCount(int * n);
Returns # of CUDA-capable devices
Can use to check if machine is CUDA-capable!
cudaSetDevice(int n)
Sets device n to the currently used device
cudaGetDeviceProperties(struct *devProp prop, int n);
Loads data from device n into prop
Device Properties
char name[256]: ASCII identifier of GPU
size_t totalGlobalMem: Total global memory available
size_t sharedMemPerBlock: Shared memory available per
multiprocessor
int regsPerBlock: How many registers we have per block
int warpSize: size of our warps
size_t memPitch: maximum pitch allowed for array allocation
int maxThreadsPerBlock: maximum number of threads/block
int maxThreadsDim[3]: maximum sizes of a block
Device Properties
int maxGridSize[3]: maximum grid sizes
size_t totalConstantMemory: maximum available constant
memory
int major, int minor: major and minor versions of CUDA
support
int clockRate: clock rate of device in kHz
size_t textureAlignment: memory alignment required for
textures
int deviceOverlap: Does this device allow for memory
copying while kernel is running? (0 = no, 1 = yes)
int multiprocessorCount: # of multiprocessors on device
Device Properties
Uses?
Actually get values for memory, instead of guessing
Program to be accessible for multiple systems
Can get the best device
Device Properties
Getting the best device:
Pick a metric (Ex.: most multiprocessors could be good)
int num_devices, device;
cudaGetDeviceCount(&num_devices);
if (num_devices > 1) {
int max_mp = 0, best_device = 0;
for (device = 0; device < num_devices; device++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
int mp_count = prop.multiProcessorCount;
if (mp_count > max_mp) {
max_mp = mp_count;
best_device = device;
}
}
cudaSetDevice(best_device);
}
Device Properties
We can also use this to launch multiple GPUs
Each GPU must have its own host thread
Multithread on CPU, each thread calls different device
Set device on thread using cudaSetDevice(n);
CUDA Runtime
Synchronization Note:
Most calls to GPU/CUDA are asynchronous
Some are synchonous (usually things dealing with memory)
Can force synchronization:
cudaThreadSynchronize()
Blocks until all devices are done
Good for error checking, timing, etc.
CUDA Events
Great for timing!
Can place event markers in CUDA to measure time
Example code:
cudaEvent_t start, stop;
cudaCreateEvent(&start); cudaCreateEvent(&stop);
cudaEventRecord(start, 0);
// DO SOME GPU CODE HERE
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
CUDA Streams
Streams manage concurrency and ordering
Ex.: call malloc, then kernel 1, then kernel 2, etc.
Calls in different streams are asynchronous!
Don’t know when each stream is where in code
Using Streams
Create stream
cudaStreamCreate(cudaStream_t *stream)
Copy memory using async calls:
cudaMemcpyAsync(…, cudaStream_t stream)
Call in kernel as another parameter:
kernel<<<gridDim, blockDim, sMem, stream>>>
Query if stream is done:
cudaStreamQuery(cudaStream_t stream)
returns cudaSuccess if stream is done, cudaErrorNotReady otherwise
Block process until a stream is done:
cudaStreamSynchronize(cudaStream_t stream)
Destroy stream & cleanup:
cudaStreamDestroy(cudaStream_t stream)
Using Streams
Example:
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size,
inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
cudaThreadSynchronize();
Next Time
Lab 4 Recitation:
3D Textures
Pixel Buffer Objects (PBOs)
Fractals!