Transcript Document

DirectX 12:

Improving Performance in Your Game Bennett Sorbo Program Manager Direct3D, Windows Graphics

Agenda

• Overview • Improving GPU efficiency • Reducing CPU overhead • Summary / Next Steps

Overview

• DirectX 12 provides a single API for low-level access to a variety of GPU hardware • Enables games to leverage higher-level knowledge to achieve great performance gains • Today, we’ll discuss best practices for specific DirectX 12 features to achieve these gains in your game

Increasing GPU efficiency

GPU Efficiency

• Three key areas for GPU-side gains • Explicit resource transitions • • Parallel GPU execution GPU-generated workloads

GPU Efficiency: Explicit resource transitions

• Modern GPUs require resources to be in different ‘states’ for different use cases, and knowledge of when these transitions need to occur • In DirectX 12, app is responsible for identifying when these transitions need to occur.

• Making these transitions explicit makes it clear when operations are expensive..

GPU Efficiency: Explicit resource transitions

(cont’d)

• .. but also gives games the opportunity to eliminate unnecessary transitions. Two key opportunities: • First, UAV synchronization is now exposed as an explicit resource barrier.

• Previously, driver would ensure all writes to a UAV were in order of dispatch by inserting “Wait for Idle” commands after each dispatch.

Dispatch WaitForIdle Dispatch WaitForIdle Dispatch WaitForIdle Dispatch

GPU Efficiency: Explicit resource transitions

(cont’d)

• If app has high-level knowledge that dispatches can run out of order, WaitForIdle’s can be removed Dispatch Dispatch Dispatch WaitForIdle Dispatch • But more importantly, dispatches can then run in parallel to achieve higher GPU occupancy Dispatch Dispatch Dispatch WaitForIdle Dispatch • Particularly beneficial for large numbers of dispatches with low thread counts

GPU Efficiency: Explicit resource transitions

(cont’d)

• Second, the ResourceBarrier API allows application to perform transitions over a period of time.

• App specifies starting/destination states at ‘begin’ and ‘end’ ResourceBarrier calls. Promises not to use resource while in transition.

• Driver can use this information to eliminate redundant pipeline stalls, cache flushes

GPU Efficiency: Explicit resource transitions

(cont’d)

• Example rendering scenario (before) API Calls Hardware Commands Draw call that renders to Tex1 Resource Barrier (Tex1) Render Target -> SRV Driver emits ‘WaitForIdle’ command … • Example rendering scenario (after) API Calls Draw call that renders to Tex1 Resource Barrier (Tex1) Render Target -> SRV

BEGIN

… Hardware Commands SetDescriptorHeap Bind Tex1 as SRV, sample in Draw call Driver emits ‘WaitForIdle’ command SetDescriptorHeap Resource Barrier (Tex1) Render Target -> SRV

END

Bind Tex1 as SRV, sample in Draw call Driver emits ‘WaitForIdle’ command

GPU Efficiency: Parallel GPU execution

• Modern hardware has the ability to run multiple workloads in parallel on multiple ‘engines’ • DirectX 12 allows games to target engines explicitly. The developer knows best about what operations can happen in parallel, what the dependencies are • Three engine types exposed in DirectX 12: 3D, Compute, Copy • Up to app to know, manage dependencies between queues

GPU Efficiency: Parallel GPU execution

(cont’d)

• The copy engine type is great for getting data around without blocking/interrupting the main 3D engine. • Two notable use cases: • Texture streaming • ‘lazy’ CPU readback • Especially great if going across PCI-E • Demo

GPU Efficiency: Parallel GPU execution

< GPUView comparison between serial/parallel execution >

GPU Efficiency: Parallel GPU execution

(cont’d)

• Really excited about compute engine scenarios as well • Two notable use cases: • • Long-running, low priority compute work Tightly interleaved 3D/Compute work within a frame • Get the gain from running different types of workloads that stress different parts of GPU • Canonical example: compute-heavy dispatches during shadow map generation.

GPU Efficiency: GPU-generated workloads

• ID3D11Asynchronous -> ID3D12QueryHeap • ID3D12CommandList::ResolveQueryData( ID3D12QueryHeap *pQueryHeap, D3D12_QUERY_TYPE Type, UINT StartElement, UINT ElementCount, ID3D12Resource *pDestinationBuffer, UINT64 AlignedDestinationBufferOffset ) • Two key performance opportunities: • Binary occlusion • Batched query ‘resolve’ operations

GPU Efficiency: GPU-generated workloads

(cont’d)

• Predication has also been generalized • ID3D12CommandList::SetPredication( ID3D12Resource *pBuffer, UINT64 AlignedBufferOffset, D3D12_PREDICATION_OP Operation) • Predicate on general buffer: query-derived, CPU-populated, GPU populated – enables new rendering scenarios

GPU Efficiency: GPU-generated workloads

(cont’d)

• ExecuteIndirect – powerful new API for executing GPU-generated Draw/Dispatch workloads • Broad hardware compatibility • Can vary the following between invocations: • Vertex/Index buffers • • Root constants, Inline SRV/UAV/CBV descriptors • Enables new scenarios, dramatic efficiency improvements

GPU Efficiency: GPU-generated workloads

(cont’d)

• Demo • Always going to be very efficient: two ways to maximize • • Set a proper ‘max count’, or just use CPU count.

Group these together, ideally put space between generation and consumption of arguments.

Reducing CPU Overhead

CPU Overhead

• Many improvements just for showing up: • No high-frequency ref-counting • • No hazard tracking No state shadowing • Three other opportunities to take advantage of: • Resource Binding • • Multi-threading Memory allocation

CPU Overhead: Resource Binding

• What’s new: • Descriptor Heap access • Root Signatures • Descriptor Heap: Actual GPU memory that contains resource access metadata • Root Signature: Binding parameters that can be passed to a shader invocation. Can contain: • Location in descriptor heap • • ‘Inline’ descriptors Actual constant data

CPU Overhead: Resource Binding

(cont’d)

• Descriptor Heap best practices • Do: keep your descriptor heap as static as possible.

• Avoid: frequently changing descriptor heaps.

• Root Signature best practices • • Do: keep your root signature small Do: take advantage of inline descriptors/data • Avoid: binding unnecessary pipeline stages • This is an area where you can move the needle on CPU performance – take advantage of the new flexibility here.

CPU Overhead: Multi-threading

• In DirectX 11, driver created background thread outside app control.

• In DirectX 12, multi-threading is app-controlled, first-class citizen via ID3D12CommandList.

• Not just command lists: you can create PSO and buffers/textures on background threads.

• Recommendation: Serial workload? Create own background submission thread.

CPU Overhead: Resource allocation

• In DirectX 11, driver-managed versioning, sub-allocation behind app’s back.

• DirectX 12 provides tools like fences, resource placement to put apps in charge. Persistently-mapped resources.

• Recommendations: • • Use appropriate number of fences Expire resources based on engine knowledge

Ashes of the Singularity

case study

Dan Baker Graphics Architect, Oxide Games

Resource Binding in Nitrous

• • Nitrous designed from start to map to hardware binding models Three key engine design points: • Textures pre-grouped in descriptor heap • • Bindings shared across shader stages – less bind calls Built around Static Samplers • Findings: • Easy to stay within one descriptor heap/frame • • Important to avoid redundant state sets Optional usage of Root CBVs can provide win • Result: resource binding overhead is a fraction of what it is on D3D11

Resource Management in Nitrous

• Nitrous also benefits from more explicit resource management • Two classes of resources: • • Formally tracked, persistent resources Temporary, frame-specific resources • Frame-specific resources linearly allocated out of heap, with no resource tracking – minimal overhead

Demo

Conclusion

• Many opportunities with DirectX 12 to achieve dramatic performance improvements in your game • Get started today!

• Enroll in the Early Access program at http://aka.ms/dxeap to receive the latest SDK, DirectX 12 drivers, documentation, … • Check out Channel9 for previous DirectX12 talks • Q/A

© 2015 Microsoft Corporation. All rights reserved. Microsoft, Xbox, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Backup

• • • < Would need to explain ‘residency’, how this worked in DX11 > < WDDM2 residency management provides flexibility/performance.

< Don’t need to track resource usage/frame if memory usage isn’t a concern – keep it all resident. >