Transcript Slide 1

Direct3D 10 and Beyond
Peter-Pike Sloan
Microsoft Corporation
The Direct3D 10 System
• Latest step in GPU evolution
• Coming to millions of PCs near you
• Large, complex system
• General overview and a few highlights
• Motivations
• Discuss current post Direct3D 10 thoughts
Prior Work
< 2001
Fixed Function
Hardware
Direct3D 7
OpenGL 1.4
Ad Hoc
Multipass
2001
Programmable
Vertex Processing
Direct3D 8
OpenGL 1.5
Assembly
Programming
2002-3
2004+
Programmable
Fragment Processing
Primitive Processing
Unified Programming
Direct3D 9
OpenGL 2.0
High Level
Shading Languages
Increasing programmability
Direct3D 10
More CPU-like
features
Design Process
• Collaboration with
ISV1
ISV2
…
ISVn
• Application Developers (ISVs)
• Hardware Developers (IHVs)
• Iterative process
DirectX
Team
• Start - spring 2003
• Spec - fall 2004
IHV1
IHV2
…
IHVm
• HW implementations - 2006
Constraints & Problems
• Preserve
• Improve
• data parallelism
• state change agility
• memory system efficiency
• implementation consistency
• coherence
• program expressiveness
• determinism
• resource limitations
• CPU offload
Performance/$$
Visual Complexity
Guiding Decisions
• Narrow gap between abstraction and implementation
• Improve overall system efficiency
• Avoid undefined behavior
• Avoid defacto defined behavior problems
• Avoid promising generality that can’t be delivered
• If you specify CPU generality, you will get CPU performance
• No new API support for older hardware
• Allows fixed feature set, tighter behavior compliance
• Cull unnecessary fixed-functions
• Performance-per-watt and -per-$$ informs what to retain
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Logical pipeline
• Programmer’s view
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Input assembler
• Fixed-function
• Canonicalize vertex data
• Generate IDs
• Primitive, vertex,
instance
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Vertex shader
• Programmable
• Vertex transformations
• 1 vertex in, 1 out
• Read from memory
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Geometry Shader
• New, programmable
• Per-primitive processing
• 1 prim in, k prims out
• Read from memory
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Stream Out
• New, fixed-function
• Divert primitive data to
1D buffers
• 1 in, 1 out
• Write to memory
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Setup/Rasterization
• Fixed-function
• Clipping, divide by w
• Convert primitives to
fragments
• 1 prim in, m frags out
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Pixel Shader
• Programmable
• Shade fragments
• 1 frag in, 0 or 1 out
• Read from memory
System Architecture
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Stream
Out
Memory
Setup/
Rasterization
Texture
Pixel
Shader
Depth
Color
Output
Merger
• Output Merger
• Fixed function
• Depth/stencil tests
• Color buffer blending
• Read/modify/write to
memory
System Architecture
Vertex
Buffer
Index
Buffer
Texture
Texture
Texture
Texture
Buffer
Memory
Input
Assembler
Vertex
Shader
Geometry
Shader
Stream
Out
Setup/
Rasterization
Texture
Texture
Pixel
Shader
Depth
Depth
Color
Color
Output
Merger
• Common programmable “Core”
• Same ISA
• Flexible memory objects
• Reuse at different stages
• Array forms of memory objects
• Indexes generated in shaders
Geometry Shader
• Entire primitive as input
• Adjacency Optional
• Outputs zero or more
primitives
• 1024 scalars out max
Geometry Shader
(0,1)
• Programmable Setup
• Generate barycentric
coordinates, interpolate
arbitrary amount of data
downstream
• Quadratic interpolation
over triangles
• Data stored/computed at
edge midpoints
• Basis functions simple
(0,0)
(1,0)
polynomials of barycentric
coordinates
• Analytic gradients
Geometry Shader
• Amplify geometry
• Expand Point Sprites
• Extrude silhouettes
• Extrude prisms/tets
[Hirche04]
Geometry Shader
• Generate Array Index for render target
array
• E.g., render to cube map
• Treat cube map as 6-element array
• Emit primitive multiple times
GS
• Per-cube face transform + array index
Render Target
Array
0
1
2
3
4
5
Determinism & Parallelism
• Allow parallel processing but preserve serial order
1
2
…
n
• Buffer GS outputs (on chip)
• Limit output to 1K 32-bit
values
GS
GS
GS
…
Expansion
to 2 triangles
• Application can specify less
• May allow greater parallelism
Stream Out
• Data from VS/GS can optionally be
streamed out to a buffer
• 32 bits per component (int or float)
• Either single buffer of up to 16 elements (64
scalars max) with flexible stride
• Up to 4 buffers that have single elements and
unit element stride
• Always sent to rasterizer if rasterizer is
enabled
Stream Out
• Generated geometry easily redrawn using
DrawAuto() command with no CPU
intervention
DrawAuto()
Multi-Stream Output
• Array-of-structures vs. structure-of-arrays
Position
Color
Normal
Texture
Position
Color
Normal
Texture
Position
Position
.
.
.
Color
Color
.
.
.
Normal
Normal
Texture
Texture
.
.
.
.
.
.
• Input Assembler supports both types as vertex
buffers
• Both styles are useful
• Access pattern vs. memory coherency
Multi-Stream Output
• Add multiple stream capability
• Compromise - support
• 1 multi-element stream with up to 16 elements (AoS)
• Up to 4 single-element (SoA) streams
• Future expansion
Programmability
• Virtual machine model
• Machine-independent intermediate language (IL)
• Just in time translation (JIT) in hardware driver
• When shader program object is created
HLSL
Program
HLSL
Compiler
IL
JIT in
Driver
Program
Object
The Virtual Machine
Direct3D 9
Direct3D 10
Instructions
64K/512
unlimited
Textures
16
128
Temporary
registers
16
32
Constants
256
4Kx16
• New Features
• Integer instruction set
• Load instruction (no store!)
• IEEE-754 format &
~accuracy
• Separate samplers &
textures
Interstage
registers
16
32
2D texture
4Kx4K
8Kx8K
Render
targets
4
8
• Writable private memory
Shading A Triangle
Per-Level Data
Per-Frame Data
Per-Instance Data
Per-Primitive Data
Per-Vertex Data
Texture
Texture
A Texture C
B
Vertex
Shader
Pixel
Shader
•
Static light positions
•
Dynamic light positions
•
Camera positions
•
View/Projection Matrices
•
Bone Matrices
•
LOD
•
Material Parameters
•
Normals, Positions, Texcoords
Constants in Direct3D 9
Per-Level Data
Per-Frame Data
Per-Instance Data
Per-Primitive Data
Per-Vertex Data
Texture
Texture
A Texture C
B
Vertex
Shader
VS
PS
constants
Pixel
Shader
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
.
.
.
Constants in Direct3D 9
Per-Level Data
Per-Frame Data
Per-Instance Data
Per-Primitive Data
Per-Vertex Data
Texture
Texture
A Texture C
B
Vertex
Shader
VS
PS
constants
Pixel
Shader
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
SetConstant()
.
.
.
Constant Buffers
Per-Level Data
Per-Frame Data
Per-Instance Data
• Split parameters into buffers
• Organize by update frequency
• Bulk update any buffer
• Bind up to 16 buffers/shader
Per-Primitive Data
Per-Vertex Data
• Sounds like 1D textures
• But, access pattern is different
• Uniform vs. Non-uniform index
• Frequent vs. Infrequent access
API/Runtime
• Plumbing for
• Creating/managing objects
• Binding state to pipeline stages
• Restructure for efficiency & flexibility
• Aggregate bits of state into large objects
• More “real” work done per API call
• Group related state together (blend, raster, stencil,
depth)
• Guide hardware implementation
Configuring the Pipeline
Vertex
Buffer
Index
Buffer
Input
Assembler
Texture
Vertex
Shader
Texture
Geometry
Shader
Buffer
Memory
Texture
Depth
Color
Stream
Out
Setup/
Rasterization
Pixel
Shader
Output
Merger
• IASetVertexBuffers/SetIndexBuffer
• IASetPrimitiveTopology
• {VS|GS|PS}SetShader
• {VS|GS|PS}SetShaderResources
• {VS|GS|PS}SetConstantBuffers
• {VS|GS|PS}SetSamplers
• SOSetTargets
• RSSetState
• RSSetViewports/ScissorRects
• OMSetRenderTargets
• OMSetBlendState
• OMSetDepthStencilState
Shading Language
• HLSL is the real API?
• Shader programs considered part of art assets!
• Support new instructions (integer, load, …)
• Parameter grouping into constant buffers
• Geometry Shader
• Multiple input vertices, multiple output support (emit & reset)
• Intrinsics for stream output
• Avoid features with large run-time (CPU) cost
• E.g., requiring re-compilation if state changes
Particle System Example
• No CPU intervention
• Particle state in 1D
buffer
• Read buffer and rewrite
2nd buffer each pass
• Use GS to add or
destroy particles
Displacement Map Example
• GS extrudes prism at
each face [Hirche04]
• PS ray casts against
height field
• Shade or discard pixel
depending on ray test
Instancing Example
• GS can determine shader, instance and
primitive ID’s used to index texture array
Sparse Morph Targets
• “Render to
VB” updates
vertices
• GS uses
stretch of
triangle to
drive
wrinkles
Other Ideas Considered
• Programmable Input Assembler
• Unwarranted complexity
• Tessellation
• Complexity too high for this design (deferred)
• Access to color/depth buffer from pixel shader
• Prohibitive performance implications
• Simultaneous read/write access to memory
• Unpredictable results → non-determinism
• Scatter, reduction operations
• Performance vs. determinism issues (deferred)
Results
• State change agility
• State objects, constant buffers, instancing, array resources
• Greater expressiveness & flexibility
• Integer, load, etc. instructions; stream out, flexible memory objects
• Fewer resource constraints
• Huge increase in resources (hardware cost)
• Feature consistency
• Very tight behavioral specification (feature set, arithmetic tolerances)
• 2 optional features (multisampling, 32-bit float texture filtering)
• CPU Offload
• Memory model, geometry shader, stream out, predicated rendering, …
Acknowledgements
Numerous software and hardware companies contributed to the design
ATI
Epic
NCsoft
Autodesk
nVidia
id
SOE
RAD
Intel
Valve
Ubisoft
XSI
S3
Blizzard
Naughty Dog Discreet
3Dlabs
Ritual
Lucas Arts
Alias
XGI
Crytek
Emogence
DirectX team
PowerVR
Bungie
Lionhead
GameFu
Matrox
Monolith
EA
…
Post Direct3D 10
Direct3D 10.1
• Small improvements for important problems
• Limited to small hardware changes
• More VS→GS inter-stage registers, VS input
• Cube map arrays
• Multi-sample control (patterns, alpha to cvg)
• Better multi-sample color & depth access
• Per-render target blending modes
• API/runtime enhancements for multi-core
• Precision improvements
Future: Addressing GPU Evolution
?
Direct3D10+
?
?
Physics
?
GPGPU
Multi-GPU
?
REYES
Raytracing
Complexity - Quality
Complexity & Balance
• Increase realism/fidelity in
weaker areas
• Complexity inflection points
require new techniques
normalized by
“importance”
Visual “Attribute”
Problems to Solve
• Content Generation
• Create more artwork faster
• 20+ GB of content to be created
• Preserve content investment
• Better Visuals
• Silhouette edges, transparency, antialiasing, texture
filtering
• Non-rendering computation
• Physics, animation, morphing…
• Programmability
• Fixed functions vs. programmability
Content Generation
• Tackle two areas – inflection points
• Texture maps
• Currently hand painted, 2K×2K  4K×4K
• Transition to procedural methods (long term)
• Improve texture management
• Character modeling with detail and deformation
• Currently skinned polygonal models with normal maps
• Transition to deformable subdiv patches with
displacement & normal maps
Tessellation
• Primary motivator is amplification of
animation/morph targets/deformation
models
• Everything stays on GPU if possible
• Displacement mapped surfaces become
first class primitives
Displaced Subdivision
Images © Fantasy Lab and Wizards of the Coast
Three-Domain Pipeline
• Patches
• 16 - 24 control points
• “Low frequency” phenomena (animation, vector
irradiance?, indirect vector irradiance?)
• Triangles
• 3 vertices
• “Mid frequency” phenomena
• Pixel fragments
• n-samples per pixel
• “High frequency” phenomena (gloss, material roughness)
(Logical) Pipeline Evolution
fixed
(Hypothetical!)
Spill patch data to memory?
programmable
memory
Constant
Control
Point
Shader
Input
Tessellator
Assembler
Sampler
Texture
Constant
Constant
Vertex
Shader
Geometry
Shader
Sampler
Vertex
Buffer
Index
Texture
Buffer
Memory
Constant
Setup
Rasterizer
Sampler Stream
out
Texture
Stream
Buffer
Pixel
Shader
Output
Merger
Sampler
Texture
Depth
Stencil
Render
Target
Tessellation with Displacement
• Integration into art pipeline
• Surface formats (SubD, bi-cubic patches)?
• Approximation?
• How much tessellation?
• Adaptive?
• How does it fit into the logical pipeline?
• New stages? How many?
• Try and keep everything on chip?
• Updating control cage, multi-pass makes sense
• Conversion to other basis – multi-pass?
Displacement Mapping
• Vertex Based
• How much tessellation?
• Interaction with
fractional tessellation?
• Are more sophisticated
tessellation schemes
required?
• Local Ray Tracing
• How inexpensive can
you make shaders?
• Interaction with MSAA?
• Shadows?
• Interaction with
hierarchical/early Z?
Improving Visual Quality
• Many areas to improve
• Some solved with programmability + performance
• But not all
•
•
•
•
•
Texture filter quality
Texture compression; e.g., HDR images
Derivatives
Order independent transparency
Antialiasing quality
• Global illumination
• Static/Parameterized GI
• Dynamic GI
• Ray casting/tracing
Transparency and Antialiasing
• Current state of art
• 4 - 8 sample multisample antialiasing
• n + 1 levels of transparency
• Transparency
• Feathered edges (foliage)
• Windshields
• Particles
• Sort transparent objects
• Alpha to coverage for alpha textures (avoid blend)
Transparency and Antialiasing
• Need to do better
• Sorting too expensive
• Must work with multipass algorithms
• E.g., apply shadow maps
• Move sort to hardware
• Track individual pixel fragments
• → A-buffer (cf. R-buffer, F-buffer, T-buffer)
A-Buffer
• Save all fragments and sort
• Memory intensive (≈64 fragments/pixel)
• Tiling to reduce memory constraint?
• Discard overflow fragments?
• Operations on pre-resolved fragments
• Shadow computations, multipass
layers
A-Buffer
• Fixed-function or programmable?
• Sorting/resolve operation
• Overflow handling
• What happens to MSAA, MRT…
• Fragment = attributes + coverage + depth
• “Defer” explosion to samples until resolve-time
• Opportunity to do better antialiasing
A-Buffer Implications
• Separate opaque and transparent object
processing
• Draw opaque first to cull invisible transparent fragments
• Switch to tiling (chunking) to save memory
• cf. predicated tiling on Xbox 360
• How much memory for fragments (100MB?)
• Exacerbated by larger displays
• Render at reduced resolution and up-sample?
• Opportunity for better resolve filtering
• Filter support larger than a pixel
Non-rendering Computations
• Direct3D 10 enables new computations using
• Additional programmability
• Integer & load instructions
• More general data flow
• Render to vertex buffer
• Animation + skinning
• Solved problem?
• Particle systems, Morphing…
GPU vs. Multicore CPU
• GPU – large flops, memory bandwidth
• Data parallelism, streaming caches
• Multi-core CPU
• Task parallelism, cache locality
• Boundary between the two is fuzzy
• Matrix multiply, sparse matrix x sparse vector
• Convergence?
Programmability
• Direct3D 10 computationally complete?
• Make entire pipeline programmable?
• Some processing more efficient as fixed function
• Set-up, Rasterization
• Hiearchical Z
• Filtering (does it need 32-bit float?)
• Clipping (do you really want to write that code?)
• Orthogonality
• Keep data types/formats independent from algorithms
Programmability
• Every function we remove…
• You may need to add back in shader code
• E.g., suppose we enable alpha-to-coverage in shaders
• Compute coverage mask and output in pixel shader
• Do we keep the fixed-function version?
• If removed, then all (pixel) shaders need to implement alpha-tocoverage
• Developer implements “virtual pipeline” in shaders
• HLSL/FX provides support for implementing “virtual
pipelines”
• Can we do more?
Dynamic Subroutines
• Do dynamic subroutines simplify/solve the problem?
• i.e., shaders with function pointers
• Call overhead must be tiny
• Otherwise, end up inlining and recompiling
• Can I dynamically stack (append) subroutines?
• A→B→C→
• Or do subroutines need to have static call sites (bind points)?
• A0→B
•
A1→C
•
A2→D
Programmability
• Next steps
• Efficient dynamic subroutine mechanism
• Eliminate combinatorial explosions
• Allow shader composition through libraries
• Need efficient dynamic binding
• cf. version 1.0 – Fragment Linker in Direct3D 9
• Generalized data parallel computation
• Neighbor communication?
• Scatter?
• Read-modify-write operations to memory
Summary
• Lots to figure out!
• Better texturing
• Surface Tessellation
• Transparency/Anti-aliasing
• General computation
Acknowledgements
• DirectX group
• David Blythe, Michael Bunnell, Shanon
Drone, Sam Glassenberg, Michael Oneppo
• IHV’s/ISV’s [see earlier slide…]
Questions?
no dates/promises for anything post Direct3D 10