GCN Performance FTW, by Stephan Hodes

Download Report

Transcript GCN Performance FTW, by Stephan Hodes

GCN PERFORMANCE „FTW“
AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM
STEPHAN HODES
DEVELOPER TECHNOLOGY ENGINEER, AMD
AGENDA
GCN architecture explained
Top 10: GCN Performance Advice
Questions
2 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
AMD GRAPHICS CORE NEXT
What is GCN?
‒Non VLIW architecture
‒ Less dependent on manual vectorization of shaders
‒ Susceptible to register pressure
‒Architecture used in:
‒ AMD discrete GPUs since 2012 (HD7700 and better)
‒ Kabini and Kaveri APUs
‒ Future AMD hardware
‒ New consoles
GCN Hardware is required for Mantle
‒ DirectX 12 API support
3 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
PRODUCT SPECIFICATIONS
AMD RADEON™ R9 290 SERIES
R9 290X
R9 290
44
40
Engine Clock
Up to 1 GHz
Up to 950 MHz
Compute Performance
5.6 TFLOPS
4.9 TFLOPS
Memory Configuration
4GB GDDR5 / 512-bit
4GB GDDR5 / 512-bit
5.0 Gbps
5.0 Gbps
Yes
Yes
DirectX 11.2
OpenGL 4.3
Mantle
DirectX 11.2
OpenGL 4.3
Mantle
Compute Units
Memory Speed
AMD TrueAudio Technology
®
API Support
4 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
®
GCN COMPUTE UNIT – SPECIFICS
 Non VLIW instruction set architecture
 4 [16-lane] Vector ALU (SIMD)
‒ One wavefront is 64 threads
‒ 1 SP (Single-Precision) op: 4 clocks
‒ 1 DP (Double-Precision) ADD: 8 clocks
‒ 1 DP MUL/FMA & Transcendental:16 clocks
‒ 64KB Vector GPRs
Branch &
Message Unit
 1 fully programmable scalar ALU
‒ Shared by all threads of a wavefront
‒ Used for flow control, pointer arithmetic, etc.
‒ 8KB Scalar GPRs, scalar data cache, etc.
Scheduler
Vector Registers
(VGPRs, 4x 64KB)
5 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
Vector Units
(4x SIMD-16)
Local Data Share
(LDS, 64KB)
Scalar Unit
Scalar Registers
(SGPRs, 8KB)
Texture Filter
Units (4)
Texture Fetch
Load / Store Units
(16)
L1 Cache
(16KB)
GCN COMPUTE UNIT – SPECIFICS
 Distributed programmable scheduler(up to 2560 threads)
‒ Each compute unit can execute
instructions from multiple kernels
‒ Separate decode/issue for:
‒ 1 Vector Arithmetic Logic Unit (ALU)
‒ 1 Scalar ALU or Scalar Memory Read
or 1 Branch/Message
‒ 1 Vector memory access
(Read/Write/Atomic)
‒ 1 Local Data Share operation
(LDS)
‒ 1 Export or Global Data Share operation
(GDS)
Branch &
Message Unit
Scheduler
Vector Registers
(VGPRs, 4x 64KB)
Plus 1 Special/Internal – [no functional unit]
(s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)
6 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
Vector Units
(4x SIMD-16)
Local Data Share
(LDS, 64KB)
Scalar Unit
Scalar Registers
(SGPRs, 8KB)
Texture Filter
Units (4)
Texture Fetch
Load / Store Units
(16)
L1 Cache
(16KB)
GCN COMPUTE UNIT – SPECIFICS
 64KB Local Data Share(LDS)
‒ 32 banks, with conflict resolution
‒ Bandwidth amplification
 16KB read/write L1 vector data cache
Branch &
Message Unit
Scheduler
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store Units
(16)
 Texture Units (utilize L1)
‒ 16 Load/Store units
‒ 4 Filter units
Vector Registers
(VGPRs, 4x 64KB)
 1 Branch & Message Unit
‒ Executes branch instructions
(as dispatched by Scalar Unit)
7 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
Local Data Share
(LDS, 64KB)
Scalar Registers
(SGPRs, 8KB)
L1 Cache
(16KB)
GCN COMPUTE UNIT – LATENCY HIDING
 Up to 10 Wavefronts/SIMD
Batch 1
Time (clocks)
‒ Used to hide latency
‒ Round Robin scheduling
‒ Independent kernels
‒ Often limited by GPR or LDS usage
Batch 2
Batch 3
Batch 4
Stall
Stall
Stall
Runnable
Stall
Runnable
Done!
Runnable
Done!
Runnable
Done!
Done!
8 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
GDC COMPUTE UNIT – REGISTER PRESSURE
 Vector GPRs
‒ 64KB / 64 threads / 4 Byte / 10 wavefronts = 25.6 VGPR/thread => Max 24 VGPR per thread
 Scalar GPRs
‒ 8KB / 4 SIMD / 4 Byte / 10 wavefronts = 51.2 SGPR/wavefronts => Max 48 SGPR per wavefront
 LDS
‒ 32KB/threadgroup and threadgroup size 64 => 2 wavefronts/CU max.
‒ 32KB/threadgroup and threadgroup size 256 => 8 wavefronts/CU max.
‒ 16KB/threadgroup and threadgroup size 256 => 16 wavefronts/CU max.
9 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
GCN SHADER OPTIMIZATION STRATEGIES
 Try reducing GPR count if you are
slightly over a waves-per-SIMD
threshold
‒ Deep nesting
‒ Local array declarations
‒ Long-lived temporary variables
 Reducing GPRs not always optimal
‒ Shadercompiler might use GPRs
to reduce latency
‒ High number of threads/CU
can thrash your caches
image_load v6, v[35:38], s[4:11]
v_mov_b32 v3, v35
image_load v7, v[3:6], s[4:11]
v_mov_b32 v38, v36
image_load v8, v[37:40], s[4:11]
v_mov_b32 v3, v37
image_load v9, v[3:6], s[4:11]
s_waitcnt vmcnt(2)
v_min_f32 v6, v6, v7
s_waitcnt vmcnt(1)
v_min_f32 v6, v6, v8
s_waitcnt vmcnt(0)
v_min_f32 v40, v6, v9
image_load v6, v[35:38], s[4:11]
v_mov_b32 v3, v35
image_load v7, v[3:6], s[4:11]
v_mov_b32 v38, v36
v_mov_b32 v3, v37
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
image_load v7, v[37:40], s[4:11]
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
image_load v7, v[3:6], s[4:11]
s_waitcnt vmcnt(0)
v_min_f32 v6, v6, v7
Always profile your changes!
http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/codexl/
http://developer.amd.com/community/blog/2014/05/16/codexl-game-developers-analyze-hlsl-gcn
10 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
Top 10 Performance Advice
11 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
TOP 10 PERFORMANCE ADVICE
1. Use the power of DirectCompute
‒ Thread group size should be multiple of 64
‒ 256 is often a good choice.
‒ Don‘t underestimate the benefits of LDS
‒ Use asynchronous compute
‒ Don‘t switch between Compute/Rasterization
too frequently
12 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
TOP 10 PERFORMANCE ADVICE
2. Don‘t over-tessellate
‒ Small triangles result in poor quad occupancy
‒ Use [maxtessfactor(X)] in Hull Shader declaration
‒ Recommended value is 15 or less
‒ Implement culling in Hull Shader
‒ Use Adaptive Tessellation
‒ Distance Adaptive
‒ Screen Space Adaptive
‒ Orientation Adaptive
Especially when rendering Shadowmaps!!!
13 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
!
TOP 10 PERFORMANCE ADVICE
3. Keep your pipeline short
‒ Avoid large expansion in the Geometry Shader
‒ Often a Vertex Shader-only solution can
replace Geometry Shader usage
‒ Bokeh expansion
‒ Pointsprites
‒ Disable tessellation pipeline if unused
4. Pack shaderstage output
‒ Limit Vertex and Domain Shader output size to
4 float4/int4 attributes for best performance.
14 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
struct PS_INPUT
{
float3 vPosition;
float3 vNormal;
float2 vTexcoord1;
float2 vTexcoord2;
float2 vTexcoord3;
}; // Unoptimal
struct PS_INPUT
{
float4 vPositionTexcoord1U;
float4 vNormalTexcoord1V;
float4 vTexcoords23;
}; // Good
TOP 10 PERFORMANCE ADVICE
5. Update your Data using map/unmap
‒ Avoid MAP_WRITE_DISCARD
‒ Prefer MAP_WRITE_NO_OVERWRITE
‒ Avoid UpdateSubresource
‒ Prefer Map and/or CopyResource instead
‒ UpdateSubresource is ok for small (<=4KB) updates
‒ CopyResource introduces GPU stalls
‒ Don‘t use the updated resource immediately
‒ Using data without copying it to local first
sometimes can improve performance
15 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
TOP 10 PERFORMANCE ADVICE
6. Use flow control with care
‒ Flow control has little overhead
‒ Skipping data fetches usually is good
‒ Avoid non-coherent codepaths
within a wavefront
‒ Watch out for GPR pressure
caused by loops and deep nested branches
// Branching code example
float fn0(float a,float b)
{
if(a>b)
return((a-b)*a);
else
return((b-a)*b);
}
v_cmp_gt_f32
s_mov_b64
s_and_b64
s_cbranch_vccz
v_sub_f32
v_mul_f32
label0:
s_andn2_b64
s_cbranch_execz
v_sub_f32
v_mul_f32
label1:
s_mov_b64
16 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
r0,r1
//a > b, establish VCC
s0,exec //Save current exec mask
exec,vcc,exec //Do “if”
label0
//Branch if all lanes fail
r2,r0,r1 //result = a – b
r2,r2,r0 //result=result * a
exec,s0,exec //Do
label1
//Branch
r2,r1,r0 //result
r2,r2,r1 //result
“else”(s0 & !exec)
if all lanes fail
= b – a
= result * b
exec,s0 //Restore exec mask
TOP 10 PERFORMANCE ADVICE
7. Pack your G-Buffer using RGBA16_UINT
‒ Fetches from RGBA16 are full rate (without filtering)
‒ Bilinear fetches to RGBA16 are half rate
‒ Exports to RGBA16_INT are full rate (without blending)
Caution: Blended exports to RGBA16_INT are ¼ speed
8. Depth buffer: don’t render after read
‒ Binding a depth buffer as texture will decompress it,
this will make subsequent Z ops more expensive.
‒ Critical for shadow map atlas rendering!
‒ Consider exporting depth to G-Buffer
17 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
TOP 10 PERFORMANCE ADVICE
9. Batch, Batch, Batch!
‒ Add support for geometry instancing
‒ Pool & batch your updates
‒ Less important with Mantle/DirectX12
‒ Reduces Drawcall overhead
‒ Allows better scheduling
10. (DX11) Prefer engine threading
over Deferred Contexts
‒ Deferred contexts are a software feature
‒ … or move to Mantle/DirectX12 
18 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
TOP 10 PERFORMANCE ADVICE
Bonus Advice
 Avoid LDS bank conflicts
‒ Accessing LDS with addresses that are
32 DWORD apart from different threads
will cause bank conflicts
‒ Unless if it‘s the same address
 Don't use gather with offsets
‒ This will result in 4 image_gather4 instructions
float4 PsExample( PsInput Input ) : SV_Target
{
return tex.GatherCmpRed(
g_SamplePointCmp,
Input.vTex,
Input.depth,
Input.depth );
int2(0,0),
}
int2(1,0),
int2(0,1),
int2(1,1) );
}
image_gather4_c_lz v0,
v4, v[2:5],
v[12:15],
s[4:11],
s[4:11],
s[12:15]
s[12:15]
v_mov_b32
s_waitcnt
v11, 1
vmcnt(0)
image_gather4_c_lz_o v5, v[11:14], s[4:11], s[12:15]
v_mov_b32
v11, 0x00000100
image_gather4_c_lz_o v7, v[11:14], s[4:11], s[12:15]
v_mov_b32
v11, 0x00000101
image_gather4_c_lz_o v0, v[11:14], s[4:11], s[12:15]
s_waitcnt
vmcnt(0)
19 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
Questions?
[email protected]
20 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
21 | GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM