Capturing the Teraflop Overview > Describing the GPU as a CPU > Fundamental principles in familiar terms > Problem Set Definition > In what.

Transcript Capturing the Teraflop Overview > Describing the GPU as a CPU > Fundamental principles in familiar terms > Problem Set Definition > In what.

Capturing the Teraflop
Overview
> Describing the GPU as a CPU
> Fundamental principles in familiar terms
> Problem Set Definition
> In what cases will I get the Teraflop?
> How to DirectCompute
> Step by Step
> Managing I/O
> Most codes are I/O bound
CPU 0
CPU 1
CPU 2
CPU 3
L2 Cache
4 Cores
4 float wide SIMD
3GHz
48-96GFlops
2x HyperThreaded
64kB $L1/core
20GB/s to Memory
$200
200W
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
L2 Cache
32 Cores
32 Float wide
1GHz
1TeraFlop
32x “HyperThreaded”
64kB $L1/Core
150GB/s to Mem
$200,
200W
CPU 0
CPU 1
CPU 2
CPU 3
L2 Cache
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
SIMD
L2 Cache
CPU
GPU
>
>
>
>
>
Low latency memory
Random accesses
20GB/s bandwidth
0.1TFlop compute
1GFlops/watt
> Well known programming
model
>
>
>
>
>
High bandwidth memory
Sequential accesses
100GB/s bandwidth
1TFlop compute
10 Gflops/watt
> Niche programming model
An Asymmetric Multi- Processor System
CPU
50GFlops
1GB/s
GPU
1TFlop
10GB/s
CPU RAM
4-6 GB
100GB/s
GPU RAM
1 GB
7
GPUs are Data-Parallel Processors
> GPU has 1000s of simultaneous ALUs
> Need 100s of 1000s of threads to hit peak
> Only data elements come in such numbers
GPUs Need Data-Parallel Algorithms
> Image processing
> Reduction, Histogram, FFT, Summed Area Table
> Video processing
> transcode, effects, analysis
> Audio
> Linear Algebra
> Simulation/Modeling:
> Technical, Finance, Academic
> Some Databases
video
Applications <> Algorithms
> Most important algorithms have known
data-parallel versions
> Algorithm was replaced with data-parallel
version:
> Sorting: Quicksort was swapped to Bitonic
demo
The Teraflop Today
N-Body Demo App:
AMD Phenom II X4 940 3GHz + Radeon HD 5850
CPU
13.7GFlops Multicore SSE, not cache-aware
GPU 537GFlops DirectCompute
Intel Xeon E5410 2.33GHz + Radeon HD 5870
CPU
25.5GFlops Multicore SSE, not cache-aware
GPU 722GFlops
DirectCompute
GFlops
Log2( size)
Applications
Domain
Libraries
Domain
Languages
Compute Languages
Processors
Media playback or processing, media
UI, recognition, etc. Technical
Accelerator, Brook+, Rapidmind, Ct
MKL, ACML, cuFFT, D3DX, etc.
DirectCompute, CUDA, CAL, OpenCL,
LRB Native, etc.
CPU, GPU, Larrabee
nVidia, Intel, AMD, S3, etc.
DirectCompute Adds Client Scenarios
> Support for multiple vendors
> All DirectX11 chips will support DirectCompute
> Some DirectX10 chips already support it
> Tight integration with rendering
> Client scenarios involve interactive playback
> Support media data-types
> Hardware format conversion for pixel formats
> Server scenarios still supported
Code Walkthrough
DirectCompute Usage
>
>
>
>
>
Initialize DirectCompute
Create some GPU code in .hlsl
Compile it using DirectX compiler
Load the code onto the GPU
Set up a GPU buffer for input data
> And set up a view into it for access
> Make that data view current
> Execute the code on the GPU
> Copy the data back to CPU memory
The HLSL Language
> HLSL is the most widely used language for
Data Parallel Programming
> Syntax is similar to ‘C/C++’
> Preprocessor defines (#define, #ifdef, etc)
> Basic types (float, int, uint, bool, etc)
> Operators, variables, functions
> Has some important differences
> No pointers 
> Built-in variables & types (float4, matrix, etc)
> Intrinsic functions (mul, normalize, etc)
HLSL Code
> Compiler (fxc or library)
generates target-specific
instructions (IL) from shader
> Different instruction sets for
different generations of
hardware
> Shader IL is highly optimized
FXC or D3D
Compiler API
Intermediate
Language
IHV Driver
Hardware
Native Code
>
>
DirectX Resources
> Data Objects in memory
> Enable out-of-bounds memory checking
> Improves security, reliability of shipped code
> Returns 0 on reads
> Writes are No-Ops
> Facilitates interop with Direct3D for display
DirectX Resource Types
> Buffer
> Defines an arbitrary data struct for the records in
this buffer object
> Includes, structured, raw, streaming buffers
> Texture*
> Storage for data that will be used in pixel tasks
> Includes 1-D, 2-D, 3-D, Cubes and arrays thereof
Buffer Resource Types
> Structured
> Defines a record size with a fixed size.
> Pixel data format is not specified, so automatic
type/format conversion not provided
> Unstructured
> Can provide type/format conversion
> Both types support non-order-preserving
> For use with Append()/Consume() I/O
Image/Media Resource Types
> Texture1D, 2D, 3D, Cube, Array
> A 2-D array of Pixels in specified format
> R8G8B8A8, R32_UINT, R16G16_UINT
Resource Views
> Resource Views define the access mechanism
for data stored in Resources (buffers)
> Support cool features like:
> Hardware accelerated format conversion
> Hardware accelerated linear filtering/sampling
> Can create multiple views onto one resource
> Enable data polymorphism while providing
info to implementation for optimal layout
Unordered Access View (UAV)
> Enables two alternative usage patterns:
> Unordered/random/scattered I/O to the
buffer it is created into
> Indexed operations for I/O
> myBuffer[index] = x;
> For Texture2D Resource, index is uint2
> Or Non-Order-Preserving I/O
> Using Append()/Consume() intrinsics
> For fastest performance when ordering of
records need not be preserved
> Or when nr of writes is unknown
Append( ResourceVar, val);
> Corresponding read operation provided for
completeness
Consume( ResourceVar, val);
> Requires buffer to have flag enabling this
Shader Resource View (SRV)
> Enables hardware accelerated filtered
>
>
>
>
sampling of the buffer
This hardware is a significant fraction of
chip area
Excellent for pixel data (images/video)
A single pixel format defined per View
Read-Only operation
> Same resource cannot be bound to shader as
SRV and as another view type at the same time
> Can also load w/o filtering
>
>
Pixel format
conversion,
Bi-linear
filtering, Gamma
correction
ALUs
Shader Execution
GPU Memory
Output
Mergers
Gamma
correction,
Pixel format
conversion,
Framebuffer
prefetch
~50 clocks
250 clocks
Texture
Samplers
> Not all threads in the call can/should share
registers with each other
> Compute threads are structured into
subsets or groups of threads
> Thread indices are available to the code:
> SV_DispatchThreadID index of thread in call
> SV_GroupThreadID
index of thread in group
> SV_GroupID
index of group in call
pDev11->Dispatch(3, 2, 1);
[numthreads(4, 4, 1)]
void MyCS(…)
aka General Purpose Registers
> Used for fast local variable storage
> Built as a block in each SIMD core
> 16k 32-bit registers per core
> Registers available per thread depends on
number of threads in the group (group size)
> E.g. 16k registers/1024 threads in group means
each thread gets 16 DWORDs
> Exceeding this limit has perf impacts:
> Registers may be spilled to memory, or
> Threads on core may be cut back (less ‘HyperThreads’)
Groupshared Memory
> New register type variable storage class
> groupshared float sfFoo;
> A whole group of threads can access the
same memory
> Enables uses like user-controlled cache
> Max 32kB can be shared in DirectX11
> 8k floats or 2k float4s
> Vs 64kB of temporary registers
> 16k floats or 4k float4s
> Using fewer is usually faster
GroupMemoryBarrier
DeviceMemoryBarrier
AllMemoryBarrier
> All I/O ops at the specified scope (group, device, or both)
before this point must complete before any other I/O ops
GroupMemoryBarrierWithGroupSync
DeviceMemoryBarrierWithGroupSync
AllMemoryBarrierWithGroupSync
> All I/O ops at the specified scope (group, device, or both)
before this point must complete before any other I/O ops
> AND all the specified threads must reach this point before
any can continue
Barrier Example
Shader()
{
groupshared GS[GROUPSIZE];
…compute the indices…
GS[sid] = myBuffer[Tid];
// Load my data element
GroupMemoryBarrierWithGroupSync();
// process the data in groupshared memory
…
…
GroupMemoryBarrierWithGroupSync();
outBuffer[Tid] = GS[sid];
}
// write my data element
Implementation Secrets
> Thread Group corresponds to a SIMD core
> 1 of 16-32 on the die
> Groupshared memory corresponds to a
partition of that core’s L1 cache
> GroupMemoryBarrier() corresponds to a
flush of that core’s I/O
Data Parallel I/O
> I/O with 1600 active threads is not trivial
> Reads are broadcast, so should be fast, but:
> Writes by many threads to one destination
can result in serialization
> Less Obvious:
> Even writing to a sequential location results
in serialization on access to the address
counter
> This is why DirectCompute provides a rich
set of I/O operations and intrinsics
> DirectX11 Compute Shader runs on most current
DirectX10 and 10.1 (4.x) parts
> Explicit thread Dispatch()
> Random-access I/O via resource variables
> Private Write/Shared Read on groupshared data
> New DirectX11-class (5.x) hardware adds
> Arbitrary accesses to groupshared data
> Atomic intrinsic operators
> Hardware format conversion on i/o
> More streaming i/o methods
Feature
CS 4.x
CS 5.0
Supported devices
DirectX10, DirectX11, Ref
DirectX11, Ref
Supported OSs
Windows7, Vista, S2008
Windows7, Vista, S2008
Max number of
threads/group
Restrictions on Zn
768
1024
Zn = 1
1<= Zn <= 64
# 32-bit registers*
4k
8k
Shared register
access
Atomic operations
Private Write / Shared
Read
Not supported
Full Indexed
Max number of
bound UAVs
Double Precision
1
8
No
Optional
DispatchIndirect( )
No
Supported
Supported
>
>
>
> http://support.microsoft.com/kb/971644
>
>
> http://msdn.microsoft.com/directx
Call to Action
> Install the DirectX11 SDK
> Try out the DirectCompute samples
> Look for parts of your code that are data
parallel
> Swap in GPU code using DirectCompute
> Experience Teraflop computing today
>
>
channel9.msdn.com/learn
Built by Developers for Developers….
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT
MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Capturing the Teraflop Overview > Describing the GPU as a CPU > Fundamental principles in familiar terms > Problem Set Definition > In what.

Transcript Capturing the Teraflop Overview > Describing the GPU as a CPU > Fundamental principles in familiar terms > Problem Set Definition > In what.

Directory