presentation title
Download
Report
Transcript presentation title
D3D12
A NEW MEANING FOR EFFICIENCY AND PERFORMANCE
DAVE OLDCORN, AMD
STEPHAN HODES, AMD
MAX MCMULLEN, MICROSOFT
DAN BAKER, OXIDE
5TH MARCH 2015
D3D11 to
D3D12
WHAT HASN’T CHANGED
D3D12 is primarily a software change
Hardware programming model is still the same
‒Few new rendering features
3 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
WHAT HAS CHANGED
The software model has changed a lot
Not just in the API, but also in the underlying
philosophy
‒Closer to the hardware
‒Give more control to the application
4 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
APPLICATION IS ARBITER OF CORRECT RENDERING
Trades off safety for power
‒If D3D11 is Javascript, D3D12 is C++
Large areas of undefined
‒... where behaviour will change with future GPUs
Use the debug layer
Stay away from the corners, don’t take risks
‒Expect “morality guides”
‒... once we know what people keep doing wrong
5 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BROAD STROKE CHANGES D3D11 -> 12
Sequential API
Queues, Command Lists
Small state blocks
State object for pipeline
Resource binding: individual objects Resource binding: tables
Automatic synchronisation, driver
tracks resource state
Manual synchronisation, app must
avoid overwrites
Implicit memory management by
OS & driver
Explicit memory management by
application
6 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
New in
D3D12
COMMAND LISTS
Each command list is executed strictly sequentially
Command lists can call out to second-level command
lists (“bundles”)
‒Some restrictions on bundles
‒Replaying bundles is OK
Top level command lists can be replayed too
‒But not until the previous submit has retired
Size them right
‒100s draws for direct lists; 10+ draws for bundle
8 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
COMMAND LISTS ENABLE CPU SIDE THREADING
Command lists can be built on arbitrary threads
‒And very quickly too
Submit is thread-safe
‒Submit in batches
Consider task oriented engines
‒Divide rendering into tasks
‒Run CPU tasks to build command lists
‒Use dependencies to order GPU submission
‒Also helps with resource barriers
9 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
ALLOCATOR AND LIST MEMORY MANAGEMENT
Lists / Allocators manage memory
‒Hang on to their resources when reset
‒Must be destroyed to fully release
memory
‒Reuse lists / allocators on ‘similar’ data
‒Destroy if data is very dissimilar
‒Don’t use pool of lists / allocators for all
possible uses
10 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
List / Allocator memory usage
Initial
100 draws
Reset
Same 100 draws
5 draws
Different 100 draws
200 draws
(Guaranteed no new allocations)
PIPELINE STATE OBJECT (PSO)
Collates most D3D11 renderstates
Compiled into hardware registers at Create time
‒Can easily be tens of ms, so use asynchronous threads
All state set onto command buffer in one go
Keep adjacent PSOs similar
Use sensible defaults for don’t care fields
Example: Rasterizer state
INT DepthBias;
FLOAT DepthBiasClamp;
FLOAT SlopeScaledDepthBias;
BOOL DepthClipEnable;
11 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
None of this
matters if depth
test is off
D3D12 RESOURCE BINDING 1
Root
Signature
Descriptor
Table
Table
Pointer
CB view
CB view
Root
Constant
Buffer
View
32-bit
constant
Table
pointer
Table
pointer
SR view
‒Root Signature describes a top-level layout
‒Pointers to descriptor tables
‒Direct pointers to constant buffers
‒Inline constants
UA view
Descriptor
Table
SR view
SR view
SR view
Table
pointer
Table driven
Shared across all shader stages
Two-level table
SR view
Changing which table is pointed to is cheap
‒It’s just writing a pointer; no synchronisation cost
Changing contents of table is harder
‒Can’t change table in flight on the hardware; no
automatic renaming
13 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE BINDING 2
Tables should be grouped by frequency of change
‒Per-draw, per-material, per-light, per-frame
‒Hint update frequency to driver by placing most frequent
changes early in root signature
14 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE BINDING TIPS
Don’t overload root signature size
‒CBVs and constants in root signature should probably be
changing every draw call
‒Bulk constant data should be in CBs not root constants
Use static tables where possible
‒Associate with object and prebuild
15 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE SYNCHRONISATION
No automatic synchronisation
Must insert barriers between usage
Three functions of barrier
‒Format conversion
‒e.g. antialiasing resolve or depth decompression
‒Synchronisation
‒Ensuring correct order of execution; e.g. compute use of a render output could start before
colour buffer is finished working on the data, due to pipelining
‒Visibility
‒Typically cache flushes, if unit A and unit B do not share the same visibility of the data
Barrier specifies previous and next usage and driver inserts appropriate
work
16 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BARRIER TIPS
Group barriers into same Barrier call
‒Will take the worst case of all, rather than potentially
incurring multiple sequential barriers
Set minimal barriers
Barriers must be correct
‒Will be a gigantic headache for IHVs if not
17 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PROFILING
D3D11 was reasonably predictable in profiling
‒Limited set of accessible bottlenecks
‒Usually fairly obvious which one you’re hitting
D3D12 environment adds new factors
‒API features: flexible resource binding, concurrency
‒Hardware limits that were pretty much impossible to bump against in
D3D11
‒Even PCIe® and system memory bus
Different hardware much more likely to have divergent behaviour
‒Test on a wide range of hardware
18 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Concurrency in
D3D12
QUEUES
Graphics
Compute
Copy
20 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Graphics, compute and
copy queues
Each is a superset
Must specify executing
queue type at record time
MULTIPLE QUEUES
Multiple queues of the
same type supported
Graphics
Queue 1
Shadowmap L0
Lighting L0
Graphics
Queue 2
Shadowmap L1
Lighting L1
Graphics
engine
Shadowmap L0
Shadowmap L1
‒Within queue: work is
ordered
‒Between separate queues
work can be arbitrarily
reordered
Use Fences to define
work order
21 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Lighting L1
Lighting L0
GAME ENGINE WORKFLOW
Heap
Defragmentation
Streaming
Dynamic Data Update
Physics
Shadowmap
Rendering
Prepare
TressFX
Particle
Multiple cascades
Point/Spotlights
e.g. generate
Min/Max Mips
Solid Post
Processing
Transparent
Obj Rendering
e.g. Particle
Rendering
22 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
G-buffer
Rendering
Post
Processing
Lighting &
Shading
UI
Rendering
Present
CONCURRENCY
Graphics, compute and / or copy may run in parallel
‒Profile to verify
‒Very familiar to console programmers
Graphics
Engine
Compute
Engine
Copy
Engine
Shadowmaps
Physics
Dynamic Data Update
G-buffer
Prepare SM
Streaming
23 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Transparent
TileDeferred
AA/AO
Defragmentation
UI
Tonemap
DEMO TIME!
Example of gains from async compute:
‒Interleaving 2 frames
G-buffer
Rendering 1
Lighting &
Shading 0
G-buffer
Rendering 2
Lighting &
Shading 1
G-buffer
Rendering 3
Lighting &
Shading 2
Sample code will be available
Sample based on DX11 work by Jason Stewart & Gareth Thomas
24 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
25 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
26 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PARALLELISE UNALIKE WORKLOADS
Engines may compete for resources
‒Bus bandwidth
‒Shader core, texture fetch for compute / graphics
‒GPRs, Caches…
The less similar the workload, the faster each runs
Bus dominated
Shader throughput
Geometry dominated
Shadow mapping
ROP heavy workloads
Many G buffer operations
DMA operations
- Texture upload
- Heap defrag
Deferred lighting (usually)
Many postprocessing effects
Most compute tasks
- Texture compression
- Physics
- Simulations
Rendering highly detailed
models
27 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
EXPLOITING CONCURRENCY
Stream
Texture
Animate
Particles
Shadow map
Shadow map
Deferred Lighting
Animate
Particles
Win!
Animate
Particles
Stream Texture
Shadow map
Deferred Lighting
Deferred Lighting
Big Win!
Stream
Texture
Profile!
Can align execution across queues with fences
‒Fences have a significant cost
‒Don’t overdo this; “a few” per frame at most
28 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BARRIERS AND MULTIPLE QUEUES
Barrier must be inserted on last queue to write
resource
‒Primarily this is for any required format conversion
Fences contain implicit acquire / release barriers
‒One of the reasons they have a high cost
29 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Resource Management in D3D12
Max McMullen
Microsoft
DIRECT3D 12 RESOURCE CREATION OVERVIEW
Direct3D 11 has a simple model, create and use
Works great given the simplicity of the abstraction
A few problems for today’s titles
‒Unpredictable performance differences due to driver workarounds
‒No high performance reuse of memory in a given frame
‒Tiled Resources added on to the original abstraction
31 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 11
API
Buffer
Texture3D
Texture2D
Texture2D
DDI
GPU VA
GPU VA
GPU VA
Physical Pages
Physical Pages
Physical Pages
Physical Pages
32 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE HEAPS
Direct3D 12 separates allocation of GPU physical pages
and GPU virtual addresses from resources
Applications can better amortize the cost of physical page
allocation
‒Reuse memory for temporaries
‒Repurpose memory when the scene no longer requires it
33 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE HEAPS
Buffer
Texture3D
Texture2D
API
Resource Heap
DDI
GPU VA
Physical Pages
Physical Pages
34 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Texture2D
RESOURCE HEAP PROPERTIES
Memory Pool
CPU Page Properties
Alignment
L0 – Closest to CPU
L1 – Closest to GPU (Discrete GPU
only)
Not Accessible (L0 & L1)
Write Combine (L0 Only)
Write Back (L0 Only)
64 KB (Default)
1 MB (Enable MSAA)
35 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
SIMPLIFIED HEAP TYPES
Memory
Pool
CPU
Properties
Usage
DEFAULT
L1 (Discrete)
L0 (Integrated)
No CPU access
UPLOAD
L0
READBACK
L0
Frequent GPU
Read/Write
Write Combine
Write Back
Write Back*
CPU Write Once, GPU Write Once,
GPU Read Once CPU Read
Max GPU
Bandwidth
Max CPU Write
Bandwidth
36 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Max CPU Read
Bandwidth
DIRECT3D 12 RESOURCE CREATION APIS
Three types of resource create
‒Committed
‒Placed
‒Reserved
Each has a different pattern of GPU VA and Physical Page
usage to enable different scenarios
37 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE CREATION APIS
Committed
Placed
Reserved
Buffer
Texture3D
Texture2D
Resource
Heap
Resource Heap
GPU VA
GPU VA
Physical Pages
Physical Pages
38 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
EFFICIENT HEAP USAGE
Prefer default heaps populated by upload heaps
‒Build a ring buffer out of one or more committed upload buffer resources, and leave
each buffer perpetually mapped for CPU access
‒Sequentially write data into each buffer with the CPU, aligning offsets as needed
‒Instruct the GPU to signal an increasing fence value at the end of each frame
‒Do not overwrite the data in the upload heap until the fence value indicates the GPU
has finished reading the data
Reuse upload heaps for dynamic data sent to GPU throughout rendering
39 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PHYSICAL MEMORY REUSE
Both reserved and placed resources must follow the same rules as
Direct3D 11 tiled resources:
An aliasing barrier must be queued when physical memory is
reused with a new resource
The application must initialize the resource memory with either a
Clear or Copy operation when first using or re-using physical
memory with a render target or depth stencil resource
40 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Efficient Memory Use in D3D12
Dan Baker
Co-Founder of Oxide Games
D3D12 MEMORY CONTROL
D3D11 – much guesswork in driver/API on where data
went, how it was referenced
ConstantBuffer dynamic map difficult to stream huge
quantities of data efficiently
D3D12 provides explicit control over memory mapping
‒Can create one large buffer per frame and stage all data
‒No specific need for a constant buffer – becomes application
construct if desired
42 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
HIGH THROUGHPUT RENDERING
To get advantage of draw call, must be hooked into
game logic
For each unit, turret, missile trail, CPU calculates
information like position or color
This data must be uploaded to the GPU – quickly as
possible
43 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
FAST DATA STREAMING TO GPU
CPU
L1 Data
Cache
GPU
GPU
Memory
L2/L3
Cache
CPU
Memory
44 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
STREAMING THE DATA
GPU memory is not write-cached, do not read
Should always write whole cache-lines out
_mm_stream_si128
‒Writes cache-line at a time
‒Will bypass L2 and L3 Cache
45 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
REAL-WORLD D3D12 EXAMPLE
Ashes of the Singularity – new mega RTS from Oxide and
Stardock
Player may have thousands of units
Every turret, bullet and missile simulated by engine
On heavy frame, Ashes uploads 40-50 mb/s of data to
GPU, 60fps = 3 GB/s
‒~20% of system bandwidth on DDR3
‒If stored in CPU memory with GPU fetch, would be doubled
46 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
WHAT A FRAME LOOKS LIKE IN ASHES
Next Frame
Current Frame
Core 1
Sim
Job
Core 2
Sim
Job
Sim
Job
Sim
Job
D3D12
CMD Job
Game
Job
D3D12
CMD Job
D3D12
CMD Job
AI Job
Sim
Job
D3D12
CMD Job
D3D12
CMD Job
AI Job
Sim
Job
D3D12
CMD Job
D3D12
CMD Job
Game
Job
D3D12
CMD Job
D3D12
CMD Job
Sim Job
Core 3
Sim Job
Core 4
Sim
Job
Core 5
Sim
Job
Sim Job
Sim Job
Sim Job
D3D12
CMD Job
Sim Job
GPU Memory
47 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12
Present Job
D3D12 DEMO
Demo of Ashes of the Singularity
48 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Questions
We are hiring!
Contact: [email protected]
49 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
50 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015