presentation title

Download Report

Transcript presentation title

D3D12
A NEW MEANING FOR EFFICIENCY AND PERFORMANCE
DAVE OLDCORN, AMD
STEPHAN HODES, AMD
MAX MCMULLEN, MICROSOFT
DAN BAKER, OXIDE
5TH MARCH 2015
D3D11 to
D3D12
WHAT HASN’T CHANGED
D3D12 is primarily a software change
Hardware programming model is still the same
‒Few new rendering features
3 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
WHAT HAS CHANGED
The software model has changed a lot
Not just in the API, but also in the underlying
philosophy
‒Closer to the hardware
‒Give more control to the application
4 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
APPLICATION IS ARBITER OF CORRECT RENDERING
Trades off safety for power
‒If D3D11 is Javascript, D3D12 is C++
Large areas of undefined
‒... where behaviour will change with future GPUs
Use the debug layer
Stay away from the corners, don’t take risks
‒Expect “morality guides”
‒... once we know what people keep doing wrong
5 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BROAD STROKE CHANGES D3D11 -> 12
Sequential API
Queues, Command Lists
Small state blocks
State object for pipeline
Resource binding: individual objects Resource binding: tables
Automatic synchronisation, driver
tracks resource state
Manual synchronisation, app must
avoid overwrites
Implicit memory management by
OS & driver
Explicit memory management by
application
6 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
New in
D3D12
COMMAND LISTS
Each command list is executed strictly sequentially
Command lists can call out to second-level command
lists (“bundles”)
‒Some restrictions on bundles
‒Replaying bundles is OK
Top level command lists can be replayed too
‒But not until the previous submit has retired
Size them right
‒100s draws for direct lists; 10+ draws for bundle
8 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
COMMAND LISTS ENABLE CPU SIDE THREADING
Command lists can be built on arbitrary threads
‒And very quickly too
Submit is thread-safe
‒Submit in batches
Consider task oriented engines
‒Divide rendering into tasks
‒Run CPU tasks to build command lists
‒Use dependencies to order GPU submission
‒Also helps with resource barriers
9 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
ALLOCATOR AND LIST MEMORY MANAGEMENT
Lists / Allocators manage memory
‒Hang on to their resources when reset
‒Must be destroyed to fully release
memory
‒Reuse lists / allocators on ‘similar’ data
‒Destroy if data is very dissimilar
‒Don’t use pool of lists / allocators for all
possible uses
10 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
List / Allocator memory usage
Initial
100 draws
Reset
Same 100 draws
5 draws
Different 100 draws
200 draws
(Guaranteed no new allocations)
PIPELINE STATE OBJECT (PSO)
Collates most D3D11 renderstates
Compiled into hardware registers at Create time
‒Can easily be tens of ms, so use asynchronous threads
All state set onto command buffer in one go
Keep adjacent PSOs similar
Use sensible defaults for don’t care fields
Example: Rasterizer state
INT DepthBias;
FLOAT DepthBiasClamp;
FLOAT SlopeScaledDepthBias;
BOOL DepthClipEnable;
11 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
None of this
matters if depth
test is off
D3D12 RESOURCE BINDING 1
Root
Signature
Descriptor
Table
Table
Pointer
CB view
CB view
Root
Constant
Buffer
View
32-bit
constant
Table
pointer
Table
pointer
SR view
‒Root Signature describes a top-level layout
‒Pointers to descriptor tables
‒Direct pointers to constant buffers
‒Inline constants
UA view
Descriptor
Table
SR view
SR view
SR view
Table
pointer
Table driven
Shared across all shader stages
Two-level table
SR view
Changing which table is pointed to is cheap
‒It’s just writing a pointer; no synchronisation cost
Changing contents of table is harder
‒Can’t change table in flight on the hardware; no
automatic renaming
13 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE BINDING 2
Tables should be grouped by frequency of change
‒Per-draw, per-material, per-light, per-frame
‒Hint update frequency to driver by placing most frequent
changes early in root signature
14 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE BINDING TIPS
Don’t overload root signature size
‒CBVs and constants in root signature should probably be
changing every draw call
‒Bulk constant data should be in CBs not root constants
Use static tables where possible
‒Associate with object and prebuild
15 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12 RESOURCE SYNCHRONISATION
No automatic synchronisation
Must insert barriers between usage
Three functions of barrier
‒Format conversion
‒e.g. antialiasing resolve or depth decompression
‒Synchronisation
‒Ensuring correct order of execution; e.g. compute use of a render output could start before
colour buffer is finished working on the data, due to pipelining
‒Visibility
‒Typically cache flushes, if unit A and unit B do not share the same visibility of the data
Barrier specifies previous and next usage and driver inserts appropriate
work
16 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BARRIER TIPS
Group barriers into same Barrier call
‒Will take the worst case of all, rather than potentially
incurring multiple sequential barriers
Set minimal barriers
Barriers must be correct
‒Will be a gigantic headache for IHVs if not
17 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PROFILING
D3D11 was reasonably predictable in profiling
‒Limited set of accessible bottlenecks
‒Usually fairly obvious which one you’re hitting
D3D12 environment adds new factors
‒API features: flexible resource binding, concurrency
‒Hardware limits that were pretty much impossible to bump against in
D3D11
‒Even PCIe® and system memory bus
Different hardware much more likely to have divergent behaviour
‒Test on a wide range of hardware
18 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Concurrency in
D3D12
QUEUES
Graphics
Compute
Copy
20 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Graphics, compute and
copy queues
Each is a superset
Must specify executing
queue type at record time
MULTIPLE QUEUES
Multiple queues of the
same type supported
Graphics
Queue 1
Shadowmap L0
Lighting L0
Graphics
Queue 2
Shadowmap L1
Lighting L1
Graphics
engine
Shadowmap L0
Shadowmap L1
‒Within queue: work is
ordered
‒Between separate queues
work can be arbitrarily
reordered
Use Fences to define
work order
21 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Lighting L1
Lighting L0
GAME ENGINE WORKFLOW
Heap
Defragmentation
Streaming
Dynamic Data Update
Physics
Shadowmap
Rendering
Prepare
TressFX
Particle
Multiple cascades
Point/Spotlights
e.g. generate
Min/Max Mips
Solid Post
Processing
Transparent
Obj Rendering
e.g. Particle
Rendering
22 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
G-buffer
Rendering
Post
Processing
Lighting &
Shading
UI
Rendering
Present
CONCURRENCY
Graphics, compute and / or copy may run in parallel
‒Profile to verify
‒Very familiar to console programmers
Graphics
Engine
Compute
Engine
Copy
Engine
Shadowmaps
Physics
Dynamic Data Update
G-buffer
Prepare SM
Streaming
23 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Transparent
TileDeferred
AA/AO
Defragmentation
UI
Tonemap
DEMO TIME!
Example of gains from async compute:
‒Interleaving 2 frames
G-buffer
Rendering 1
Lighting &
Shading 0
G-buffer
Rendering 2
Lighting &
Shading 1
G-buffer
Rendering 3
Lighting &
Shading 2
Sample code will be available
Sample based on DX11 work by Jason Stewart & Gareth Thomas
24 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
25 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
26 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PARALLELISE UNALIKE WORKLOADS
Engines may compete for resources
‒Bus bandwidth
‒Shader core, texture fetch for compute / graphics
‒GPRs, Caches…
The less similar the workload, the faster each runs
Bus dominated
Shader throughput
Geometry dominated
Shadow mapping
ROP heavy workloads
Many G buffer operations
DMA operations
- Texture upload
- Heap defrag
Deferred lighting (usually)
Many postprocessing effects
Most compute tasks
- Texture compression
- Physics
- Simulations
Rendering highly detailed
models
27 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
EXPLOITING CONCURRENCY
Stream
Texture
Animate
Particles
Shadow map
Shadow map
Deferred Lighting
Animate
Particles
Win!
Animate
Particles
Stream Texture
Shadow map
Deferred Lighting
Deferred Lighting
Big Win!
Stream
Texture
Profile!
Can align execution across queues with fences
‒Fences have a significant cost
‒Don’t overdo this; “a few” per frame at most
28 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
BARRIERS AND MULTIPLE QUEUES
Barrier must be inserted on last queue to write
resource
‒Primarily this is for any required format conversion
Fences contain implicit acquire / release barriers
‒One of the reasons they have a high cost
29 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Resource Management in D3D12
Max McMullen
Microsoft
DIRECT3D 12 RESOURCE CREATION OVERVIEW
Direct3D 11 has a simple model, create and use
Works great given the simplicity of the abstraction
A few problems for today’s titles
‒Unpredictable performance differences due to driver workarounds
‒No high performance reuse of memory in a given frame
‒Tiled Resources added on to the original abstraction
31 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 11
API
Buffer
Texture3D
Texture2D
Texture2D
DDI
GPU VA
GPU VA
GPU VA
Physical Pages
Physical Pages
Physical Pages
Physical Pages
32 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE HEAPS
Direct3D 12 separates allocation of GPU physical pages
and GPU virtual addresses from resources
Applications can better amortize the cost of physical page
allocation
‒Reuse memory for temporaries
‒Repurpose memory when the scene no longer requires it
33 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE HEAPS
Buffer
Texture3D
Texture2D
API
Resource Heap
DDI
GPU VA
Physical Pages
Physical Pages
34 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Texture2D
RESOURCE HEAP PROPERTIES
Memory Pool
CPU Page Properties
Alignment
L0 – Closest to CPU
L1 – Closest to GPU (Discrete GPU
only)
Not Accessible (L0 & L1)
Write Combine (L0 Only)
Write Back (L0 Only)
64 KB (Default)
1 MB (Enable MSAA)
35 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
SIMPLIFIED HEAP TYPES
Memory
Pool
CPU
Properties
Usage
DEFAULT
L1 (Discrete)
L0 (Integrated)
No CPU access
UPLOAD
L0
READBACK
L0
Frequent GPU
Read/Write
Write Combine
Write Back
Write Back*
CPU Write Once, GPU Write Once,
GPU Read Once CPU Read
Max GPU
Bandwidth
Max CPU Write
Bandwidth
36 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Max CPU Read
Bandwidth
DIRECT3D 12 RESOURCE CREATION APIS
Three types of resource create
‒Committed
‒Placed
‒Reserved
Each has a different pattern of GPU VA and Physical Page
usage to enable different scenarios
37 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DIRECT3D 12 RESOURCE CREATION APIS
Committed
Placed
Reserved
Buffer
Texture3D
Texture2D
Resource
Heap
Resource Heap
GPU VA
GPU VA
Physical Pages
Physical Pages
38 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
EFFICIENT HEAP USAGE
Prefer default heaps populated by upload heaps
‒Build a ring buffer out of one or more committed upload buffer resources, and leave
each buffer perpetually mapped for CPU access
‒Sequentially write data into each buffer with the CPU, aligning offsets as needed
‒Instruct the GPU to signal an increasing fence value at the end of each frame
‒Do not overwrite the data in the upload heap until the fence value indicates the GPU
has finished reading the data
Reuse upload heaps for dynamic data sent to GPU throughout rendering
39 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
PHYSICAL MEMORY REUSE
Both reserved and placed resources must follow the same rules as
Direct3D 11 tiled resources:
An aliasing barrier must be queued when physical memory is
reused with a new resource
The application must initialize the resource memory with either a
Clear or Copy operation when first using or re-using physical
memory with a render target or depth stencil resource
40 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Efficient Memory Use in D3D12
Dan Baker
Co-Founder of Oxide Games
D3D12 MEMORY CONTROL
D3D11 – much guesswork in driver/API on where data
went, how it was referenced
ConstantBuffer dynamic map difficult to stream huge
quantities of data efficiently
D3D12 provides explicit control over memory mapping
‒Can create one large buffer per frame and stage all data
‒No specific need for a constant buffer – becomes application
construct if desired
42 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
HIGH THROUGHPUT RENDERING
To get advantage of draw call, must be hooked into
game logic
For each unit, turret, missile trail, CPU calculates
information like position or color
This data must be uploaded to the GPU – quickly as
possible
43 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
FAST DATA STREAMING TO GPU
CPU
L1 Data
Cache
GPU
GPU
Memory
L2/L3
Cache
CPU
Memory
44 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
STREAMING THE DATA
GPU memory is not write-cached, do not read
Should always write whole cache-lines out
_mm_stream_si128
‒Writes cache-line at a time
‒Will bypass L2 and L3 Cache
45 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
REAL-WORLD D3D12 EXAMPLE
Ashes of the Singularity – new mega RTS from Oxide and
Stardock
Player may have thousands of units
Every turret, bullet and missile simulated by engine
On heavy frame, Ashes uploads 40-50 mb/s of data to
GPU, 60fps = 3 GB/s
‒~20% of system bandwidth on DDR3
‒If stored in CPU memory with GPU fetch, would be doubled
46 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
WHAT A FRAME LOOKS LIKE IN ASHES
Next Frame
Current Frame
Core 1
Sim
Job
Core 2
Sim
Job
Sim
Job
Sim
Job
D3D12
CMD Job
Game
Job
D3D12
CMD Job
D3D12
CMD Job
AI Job
Sim
Job
D3D12
CMD Job
D3D12
CMD Job
AI Job
Sim
Job
D3D12
CMD Job
D3D12
CMD Job
Game
Job
D3D12
CMD Job
D3D12
CMD Job
Sim Job
Core 3
Sim Job
Core 4
Sim
Job
Core 5
Sim
Job
Sim Job
Sim Job
Sim Job
D3D12
CMD Job
Sim Job
GPU Memory
47 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
D3D12
Present Job
D3D12 DEMO
Demo of Ashes of the Singularity
48 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
Questions
We are hiring!
Contact: [email protected]
49 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
50 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5TH 2015