Transcript Document
DirectX 9 & Radeon 9700
Performance Optimizations
Richard Huddy
[email protected]
DirectX 9 and Radeon 9700 considerations
Resources
• Sorting and Clearing
• Vertex Buffers and Index Buffers
• Render States
• How to draw primitives
• Vertex Data
• Vertex Shaders
• Pixel Shaders
• Textures
• Targets (both Z and color)
• Miscellaneous
•
General resource management
•
•
•
Create your most important resources first
(that’s targets, shaders, textures, VB’s, IB’s
etc)
“Most important” is “most frequently used”
Never call Create in your main loop
– So create the main colour and Z buffers
before you do anything else…
•
The “main buffer” is the one through which the largest
number of pixels pass…
Sorting
•
Sort roughly front to back
– There’s a staggering amount of hardware
devoted to making this highly efficient
•
Sort by vertex shader
…or…
•
•
•
•
Sort by pixel shader, or
sort by texture
When you change VS or PS it’s good to go
back to that shader as soon as possible…
Short shaders are faster^2 when sorted
Clearing
•
Ideally use Clear once per frame (not less)
– Always clear the whole render target
•
Don’t track dirty regions at all
– Always clear colour, Z and stencil together
unless you can just clear Z/stencil
•
Most importantly don’t force us to preserve stencil
Don’t use 2 triangles to clear…
• Using Clear() is the way to get all the fancy
Z buffer hardware working for you
•
Vertex Buffers
•
•
•
•
•
Use the standard DirectX8/9 VB handling
algorithm with NOOVERWRITE etc
Try to always use DISCARD at the start of
the frame on dynamic VB’s
Specify write-only whenever possible
Use the default pool whenever possible
Roughly 2 – 4 MB for best performance
– This allows large batches
– And gives the driver sufficient granularity
Index Buffers
•
Treat Index Buffers exactly as if they were
vertex buffers – except that you always
choose the smallest element possible
– i.e. Use 32 bit indices only if you need to
– Use 16 bit indices whenever you can
•
All recent ATI hardware treats Index
Buffers as ‘first class citizens’
– They don’t have to be copied about before the
chip gets access
– So keep them out of system memory
Updating Index and Vertex Buffers
•
•
IBs and VBs which are optimally located
need to be updated with sequential
DWORD writes.
AGP memory and LVM both benefit from
this treatment…
Handling Render States
•
Prefer minimal state blocks
– ‘minimal’ means you should weed out any
redundant state changes where possible
•
•
•
If 5% of state changes are redundant that’s OK
If 50% are redundant then get it fixed!
The expensive state changes:
– Switching between VS and FF
– Switching Vertex Shader
– Changing Texture
How to draw primitives
•
DrawIndexedPrimitive( strip or list )
– Indexing is a big win on real world data
– Long strips beat everything else
– Use lists if you would have to add large
numbers of degenerate polys to stick with
strips (more than ~20% means use lists)
– Make sure your VB’s and IB’s are in optimal
memory for best performance
– Give the card hundreds of polys per call
•
Small batches kill performance
Vertex data
•
Don’t scatter it around
– Fewer streams give better cache behaviour
•
Compress it if you can
– 16 bits or less per component
– Even if it costs you 1 or 2 ops in the shader…
•
Try to avoid spilling into AGP
– Because AGP has high latency
•
pow2 sizes help – 32 bytes is best
– Work the cache on the GPU
•
Avoid random access patterns where possible by
reordering vertex data before the main loop…
– That’s at app start up or at authoring time
Compiling and Linking shaders
•
Do this all “up front”
– It may not be obvious to you - but you have to
actually use a shader to force it’s complete
instantiation in DirectX 9
– So, if you’re not careful you may get linking
happening in your main loop
– And linking may be time consuming
– Draw a little of everything before you start for
real. Think of this as priming the caches…
Vertex shaders
•
•
Shorter shaders are faster – no surprises here…
Avoid all unnecessary writes
–
–
–
–
•
•
•
I
This includes the output registers of the VS
So use the write masks aggressively
Pack constants as much as possible
Prefer locality of reference on constants too…
Be aware of the expansion of macros but prefer
them anyway if they match exactly what you want
Pack your shader constant updates
You should optimise the algorithm and leave the
object-code optimisation to the driver/runtime
Vertex shaders
•
II
Branches and conditionals are fast so use them
agressively
– That’s not like the CPU where branches are slow…
– Longer shaders allow better batching
•
Shorter shaders are also more cache friendly
– i.e. it’s usually faster to switch to the previous shader
than to any other
– But the shorter your shaders are…
– …the more of them fit into the cache.
Vertex shaders
•
II
API Change:
– Now you don’t “mov” to the address register, you use
“mova”
– And this performs round to nearest, not floor
– And now A0 is a 4d register
•
A0.x, A0.y, A0.z, A0.w
Pixel shaders
•
I
API change to accommodate MET’s:
– You now have to explicitly write to oC0, oC1,
oC2 and 0C3 to set the output colour
– And the write has to be with a mov instruction
– If you write to 0C[n] you must write to all
elements from oC[0] to 0c[n-1]
•
•
•
i.e. Writes must be contiguous starting at oC0
But the writes can happen in any order
You can also write to oDepth to update the
Z buffer but note that this kills the early Z
cull… (this replaces ps1.3 texdepth)
Pixel shaders
•
II
Shorter is much faster
– It’s much easier to be pixel limited than vertex
limited
– Short shaders are more cache friendly
– Be aggressive with write masks
– Think dual-issue (“+”) even though it’s gone
from the API (so split colour and alpha out)
•
Generally prefer to spend cycles on shader
ops rather than using texture lookups
– Because memory latency is the enemy here
Pixel shaders
•
III
Dual issue?
– But that’s not in the 2.0 shader spec…
– But remember that DX9 hardware like the
Radeon 9700 has to run DirectX 8 apps very
fast indeed
– And that means it has dual issue hardware
ready for you to use
Pixel shaders
•
IV
Example : Diffuse + specular lighting
…
dp3 r0, r1, r0 // N.H
dp3 r2, r1, r2 // N.L
mul r2, r2, r3 // * color
mul r2, r2, r4 // * texture
mul r0.r, r0.r, r0.r // spec^2
mul r0.r, r0.r, r0.r // spec^4
mul r0.r, r0.r, r0.r // spec^8
mad r0.rgb, r0.r, r5, r2
…
Total: 8 instructions
…
dp3 r0, r1, r0
// N.H
dp3 r2.r, r1, r2
// N.L
mul r6.a, r0.r, r0.r // spec^2
mul r2.rgb, r2.r, r3 // * color
mul r6.a, r6.a, r6.a // spec^4
mul r2.rgb, r2, r4 // * texture
mul r6.a, r6.a, r6.a // spec^8
mad r0.rgb, r6.a, r5, r2
…
Optimized to 5 “DI” instructions
Pixel shaders
•
IV
Texture instructions
– Avoid TEXDEPTH to retain the early Z-reject
– If you do choose to use TEXKILL then use it
as early as possible. [But, the positioning of
TEXKILL within texture loading code is
unimportant]
•
Register usage
– Minimize total number of registers used
– No problems with dependency
Vertex and Pixel shaders
•
•
•
•
If you’re fed up with writing assembler, and
don’t feel excited by the opportunity to
code 256 VS ops and 96 PS ops then…
…maybe you should consider HLSL?
In most cases it is as good as hand written
assembler
And much faster to author…
– Perfect for prototyping
– And for release code where you use D3DX
Textures
•
I
API addition
– SetSamplerState()
– Handles the now-decoupled texture sampler
setup.
– You may now freely mix and match texture
coordinates with texture samplers to fetch
texels in arbitrary ways
•
•
Texture coordinates are now just iterated floats
Samplers handle clamp, wrap, bias and filter modes
– You have 8 texture coordinates
– And 16 texture samplers
•
texld r11, t7, s15 (all register numbers are max)
Textures
•
II
Use compressed textures
– Do you need a good compressor?
•
•
•
Use smaller textures
Use 16 bit textures in preference to 32 bit
Use textures with few components
– Use an L8 or A8 format if that’s what you want
•
Pack textures together
– e. g. If you’re using two 2D textures then
consider using a single RGBA texture
•
Texture performance is bandwidth limited
Textures
•
III
Filtering modes
– Use trilinear filtering to improve texture cache
coherency
– Only use anisotropic or tri-linear filtering when
they make sense - they are more expensive
– Avoid using anisotropic filtering with
bumpmapping
– Avoid using tri-linear anisotropic filtering
unless the quality win justifies it
– More costly filtering is more affordable with
longer pixel shaders
Targets
•
•
Always clear the whole of the target
Present():
– WASSTILLDRAWING makes a comeback
– Please use it!
– Because using it properly will gain you CPU
cycles - and that’s typically your scarcest
resource
Depth Buffer
•
•
I
Never lock depth buffers
Clearing depth buffers
– Clear the whole surface
– When stencil is present clear both depth and
stencil simultaneously
•
•
If possible disable depth buffering when
alpha blending (i.e. drawing HUD’s)
Use as few depth buffers as possible…
– i.e. re-use them across multiple render
targets
Depth Buffer
•
II
Efficiently use Hyper-Z
– Render front to back
– Make Znear, Zfar close to active depth range
of the scene
– The EQUAL and NOT EQUAL depth tests
require exact compares which kill the early Z
comparisons. Avoid them!
Occlusion query
•
New to DirectX 9
– In GL you have HP_occlusion_query and
NV_occlusion_query to avoid the need for locks
•
Not free, but much cheaper than Lock()
•
Supported on all ATI hardware since the
Radeon 8500
•
CreateQuery(OCCLUSION, ppQuery)
Issue(Begin/End)
GetData() returns S_OK to signal completion but please don’t spin waiting for the answer…
•
•
AGP 8X
Is fast at ~2GB per second
• But has high latency compared to LVM
• And is 10 times slower than LVM
• Radeon 9700 has up to 20GB per sec of
bandwidth available when talking to LVM
•
– (LVM = Local Video Memory)
User clip planes
•
User clip planes are much more efficient than
texkill because:
1. They insert a per-vertex test, rather than a per-pixel
test, and vertices are typically fewer in number than
pixels
2. It’s important always to kill data at the earliest stage
possible in the pipeline
•
•
Plus, clipping is essentially a geometric
operation
All hardware which supports ps1.4 supports
user clip planes in hardware
Sky box. First or last?
•
Draw it last because:
– That’s a rough front to back sort
– In this case you know that most sky pixels will fail
the Z test.
•
Draw it first because:
– That way you don’t need any Z tests
– In this case you know that most sky pixels would
pass the Z test
So, here is our target:
•
DX9 style mainstream graphics (per frame):
–
–
–
–
–
–
–
–
–
> 500K triangles
< 500 DrawIndexedPrimitive() calls
< 500 VertexBuffer switches
< 200 different textures
< 200 State change groups
Few calls to SetRenderTarget - aim for 0 to 4...
1 pass per poly is typical, but 2 is sometimes smart
Runs at monitor refresh rate
Which gives more than 40 million polys per second
•
And everything goes through the programmable pipeline
– No occurrences of Lock(0), DrawPrimitive(),
DPUP()
Questions…
?
Richard Huddy
[email protected]