GDC 2005 - Home

Download Report

Transcript GDC 2005 - Home

Getting The Best Out Of D3D12
Evan Hart, Principal Engineer, NVIDIA
Dave Oldcorn, D3D12 Technical Lead, AMD
Prerequisites
●
An interest in D3D12

Ideally, already looked at D3D12
●
Experienced Graphics Programmer
●
Console programming experience

Beneficial, not required
Brief D3D12 Overview
The ‘What’ of D3D12
●
Broad rethinking of the API
●
Much closer to HW realities
●
Model is more explicit

Less driver magic
“With great power comes great
responsibility.”
●
D3D12 answers many developer requests
●
Be ready to use it wisely and it can
reward you
Console Vs PC
●
D3D12 offers a great porting story


●
More of the explicit control console devs crave
Much less driver interference
Still a heterogeneous environment



Need to test carefully
Heed API and tool warnings (exposed corners)
Game will run on HW you never tested
Central Objects to D3D12
●
Command Lists
●
Bundles
●
Pipeline State Objects
●
Root Signature and Descriptor Tables
●
Resource Heaps
Using Bundles And Lists
Dispatch
Draw
Bundle
Command List
Frame
Command Lists & Bundles
●
Bundle



●
Small object recording a few commands
Great for reuse, but a subset of commands
Like drawing 3 meshes in an object
Command List


Useful for recording/submitting commands
Used to execute bundles and other commands
Pipeline State Object
●
Collates most render state

●
Shaders, raster, blend
All packaged and swapped together
Pipeline State Object
Pixel Shader
Rasterizer State
Vertex Shader
Blend State
Geometry Shader
Depth State
Hull Shader
Topology
Domain Shader
RT Format
Compute Shader
Input Layout
Pipeline State
Root Signature & Descriptor Tables
●
New method for resource setting
●
Flexible interface



Methods for changing large blocks
Methods for small bits quickly
Indexing and open-ended tables enable
“bindless”-like behaviour
Resource Heaps
●
New memory management primitive
●
Tie multiple related resources into one
heap
●
App controls residency on the heap

●
Somewhat coarse
Enables console-like memory aliasing
New HW Features
●
Conservative Rasterization
●
Raster Ordered Views
●
Typed UAV
●
PS write of stencil reference
●
Volume tiled resources
Advice for the D3D12 Dev
Practical Developer Advice
●
Small nuggets on key issues
●
Advice is from experience


Multiple engines have done trial ports
Many months of experimentation
•
Driver, API, and app level
Efficient Submission
●
Record commands in parallel
●
Reuse fragments via bundles
●
Taking over some driver/runtime work

●
Make sure your code is efficient (and parallel)
Submit in batches with ExecuteCmdLists

Submit throughout the frame
Engine organisation
●
Consider task oriented engines




Divide rendering into tasks
Run CPU tasks to build command lists
Use dependencies to order GPU submission
Also helps with resource barriers
Threading: Done Badly
Aux
Thread
Aux
AuxThread
Thread
Game Thread
Command
List 0
Submit
Create
Resource
Command
List 1
Submit
Present
Render Thread
App render code, runtime, driver all on one!
Threading: Done Well
Game Thread
Create
Resource
Create
Resource
Compile
PSO
Async Thread
Command
List 1
Command List 2
Worker Thread
Command
List 0
Submit
CL0
Submit
CL1
Command
List 3
Submit
CL2
Submit
CL3
Master Render Thread
Many solutions, key is parallelism!
Present
PSO Practicalities
●
Merged state removes driver validation costs
●
Don’t needlessly thrash state

Just because it is a PSO, doesn’t mean every state
needs to flip in HW
•
•

Avoid toggling compute/graphics
Avoid toggling tessellation
Use sensible defaults for don’t care fields
Creating PSOs
●
PSO creation can be costly

●
Probably means a compile
Streaming threads should handle PSO



Gather state and create on async threads
Prevents stalls
Can handle specializations too
Deferred PSO Update
●
“Quick first compile; better answer later”



●
Simple / generic / free initial shader
Start the compile of the better result
Substitute PSO when it’s ready
Generic / specialized especially useful


Precompile the generic case
More optimal path for special cases, compiled on
low priority thread
Using Bundles And Lists
Dispatch
Draw
Bundle
Command List
Frame
Bundle Advice
●
Aim for a moderate size (~12 draws)

●
Some potential overhead with setup
Limit resource binding inheritance when
possible

Enables more complete cooking of bundle
Lists Advice
●
Aim for a decent size

Typically hundreds of draw calls
●
Submit together when feasible
●
Don’t expect lots of list reuse


Per-frame changes + overlap limitation
Post-processing might be an exception
•
Still need 2-3 copies of that list
Using Command Allocators
Allocators and Lists
●
Invisible consumers of GPU
memory
●
Hold on to memory until Destroy
●
Reuse on similar data

●
Warm list == no allocation during list
creation
Destroy on different data

Reuse on disparate cases grows all
lists to size of worst case over time
List / Allocator memory usage
Initial
100 draws
Reset
Same 100 draws
5 draws
Different 100 draws
200 draws
(Guaranteed no
new allocations)
Allocator Advice
●
Allocators are fastest when warm

●
Keep reusing allocator with lists of equal size
Need 2T + N allocators minimum



T -> threads creating command lists
N -> extra pool for bundles
All lists/bundles on an allocator freed together
•
Need to double/triple buffer for reusing the allocators
Root Signature
●
Carefully layout root
signature


●
Group tables by
frequency of change
Most frequent changes
early in signature
Standardize slots

Signature change costs
Per-Draw
Table
Pointer
Tex
Tex
Constant
Buffer pointer
(Modelview
matrix,
skinning)
Per-draw
constants
Per-Material
Table
Pointer
Const
Buf
(shader
params)
Per-Frame
Table
Pointer
Const
Buf
(camera
, eye...)
Const
Buf
(shader
params)
Tex
Tex
Tex
Root Signature Cnt’d
●
●
Place single items which change per-draw in
the root arguments
Costs of setting new table vary across HW

●
Cost varies from nearly 0 to O(N) work where N is
items in table
Avoid changes to individual items in tables


Requires app to instance table if in flight
Try to update whole table atomically
Managing Resources with Heaps
●
Committed

●
Heap
Placed

●
Monolithic, D3D11-style
Resource [VA]
Offset in existing heap
G-buffer
Postprocess buffer
Heap
Reserved

Mapped to heaps like
tiled resources
Heap
Choosing a resource type:
Committed
Need per-resource residency
Don’t need aliasing
Placed
Cheaper create / destroy
Can group in heaps of similar residency
Want to alias over others
Small resources
Tiled /
Reserved
Need flexibility of memory management
Can tolerate CPU and GPU overheads of ResourceMap
Resource tips
●
Committed gives driver more knowledge
●
Tiled resources have separate caps

●
Need to prepare for HW without it
Memory might be segmented

Cannot allocate entire space in a single heap
Residency tips
●
MakeResident:


●
Batch these up
Expect CPU and GPU cost for page table
updates
MakeUnresident

Cost of move may be deferred; may be seen
on future MakeResident
Working Set Management
●
Application has much more control in D3D12
●
Directly tells the video memory manager
which resources are required
●
App can be sharper on memory than before


On D3D11, working set per frame typically much
smaller than registered resource
Less likely to end up with object in slow memory
Working to a budget
●
●
“Budget” is the memory you can use
Get under the budget using residency


●
MakeUnresident makes object candidate to swap to
system memory
It is much cheaper to unresident, then later
resident again, than to destroy and create
Tiled resources can drop mip levels
dynamically
Barriers & Hazards
●
Most objects stay in one state from creation

●
Always specify the right set of target units

●
Don’t insert redundant barriers
Allows for minimal barrier
Group barriers into same Barrier call

Will take the worst case of all, rather than
potentially incurring multiple sequential barriers
Barriers enhance concurrency
Resources both read and written in a given
draw created dependency between draws

Most common case was UAV used in adjacent
dispatches
Logical view of draws
GPU timeline of draws
Dispatches (D3D11)
Draw 0
Draw 1
Draw 0
Barrier
●
Draw 2
Draw 1
Draw 2
Dispatch 0
Draw 3
Dispatch 1
Dispatch 2
Draw 3
Barrier enables overlap
●
Explicit barrier eliminates issue

App tells API when a true dependency exists,
rather than it being assumed
Logical view of dispatches
Dispatches with explicit
barrier control
Dispatch 0
Dispatch 1
Dispatch 0
Dispatch 1
Dispatch 2
Dispatch 2
CPU side
●
D3D12 simplifies picture


●
Easier to associate driver effort with
application actions
Less likely that driver itself is the bottleneck
Be aware of your system buses
GPU side
●
Environment is new


●
Less familiar without console experience
Interesting new hardware limits are now
accessible
Use the tools
Wrap up
Get Ready
●
D3D12 done right isn’t just an API port

More so when referring to consoles
●
Good engine design offers a lot of
opportunity
●
The power you’ve been asking for is here
Questions