Title Text Placeholder, Arial 30pt
Download
Report
Transcript Title Text Placeholder, Arial 30pt
Kenneth Hurley
Sr. Software Engineer
[email protected]
What are the problems we are seeing
when 3D engines are written?
•
•
•
•
•
•
Misuse of Vertex Buffers
Concurrency Limitations
Frame Rate Limiters
Non-Optimized surface usage
Cache misses
Data Ordering
NVIDIA Corporation
Misuse of Vertex Buffers
• Bad Things can happen unless you know the
“right” way to use a vertex Buffer
•
Dynamic vertex buffer vs. static vertex buffers
• When creating the vertex buffer, use
D3DVBCABS_WRITEONLY
• Use D3DLOCK_DISCARDCONTENTS
• Use D3DLOCK_NOOVERWRITE
•
Vertex buffer ordering
• Use ordered vertex buffers because of cache
coherency
NVIDIA Corporation
Using Vertex Buffers Correctly
NVIDIA Corporation
Example vertex buffer flow
C re a te V e rte x
B u ffe r fro m
1 0 0 0 -1 2 0 0 0
Index = 0
•
CreateVB(WRITEONLY, 1000-12000)
•
A: I = 0
•
B: Space in VB for M vertices?
R o o m I n V e rt e x
B u f f e r?
No
Y es
L o c k (D is c a rd
C ontents )
L o c k (N o
O v e rw rit e )
•
Yes: Lock(NOOVERWRITE)
•
No: GOTO C
•
Fill in M vertices at index I
•
Unlock(); DIPVB(I); I += M; GOTO
B;
•
C: Lock(DISCARDCONTENTS)
GOTO A
S t o re
V e rt ic e s
at Index
U n lo c k V e rt e x
B u f f e r/
D ra w I n d e x e x P rim
V B (I n d e x , le n g t h )
Index + = N um ber
o f V e rt ic e s
NVIDIA Corporation
Concurrency
• Why do I need it?
•
Concurrency helps parallelism between the CPU
and the GPU.
• OK, How do I achieve it?
•
Use NVPAT to see if “Spin Lock” is happening.
• “Spin Locks” are when the driver has to stall waiting
for the hardware to finish with an object
• These objects can be vertex buffers or texture
surfaces
NVIDIA Corporation
Concurrency (cont.)
• Use the vertex buffer and texture surface flags so
the driver can give you another buffer while the
hardware is using the other one.
NVIDIA Corporation
Frame Rate Limiters
• Can cause concurrency issues
• Better ways to achieve constant frame rates
• Makes effective triangle rate much lower, because
driver has to do some work with vertex data.
NVIDIA Corporation
Frame Rate Limiter Problem
Serialization of code loop
P h y s ic s
A rtific ia l
In te llig en c e
C ulling
S u b m it
T ria ng le s
W a it fo r D e s ire d F ra m e R ate
T & L /G P U R a s te riz atio n
Rescheduled for concurrency
P h y s ic s
F ra m e 1
A rtific ia l
In te llig en c e
F ra m e 1
C ulling
F ra m e 1
S u b m it
T ria ng le s
F ra m e 1
P h y s ic s
F ra m e 2
A rtific ia l
In te llig en c e
F ra m e 2
T & L /G P U R a s te riz atio n
NVIDIA Corporation
C ulling
F ra m e 2
W a it fo r
D es ire d
F ra m e R a te
Non Optimized Surface Usage
• Locking a texture before the GPU is finished with
it causes concurrency problems by stalling the
CPU inside the driver.
• Typical examples include locking the backbuffer
to do 2D operations on it
• The best solution for this is to use 2 screen
aligned triangles (quad) instead and put them
directly in the 3D pipeline
NVIDIA Corporation
Cache Misses
• Big slowdowns can occur here
• CPU cache misses can occur because of ordering
of vertex data. Check these carefully with VTune.
• GPU has a vertex cache also. Geforce has a 16
entry cache, but optimal cache use is 10, because
6 triangles can be “in flight” at any given time.
• GPU vertex cache statistics will be added to
NVPAT.
NVIDIA Corporation
Vertex Ordering
• Best performance is to also order vertex data and
vertex indices in sequential order. This helps
both the CPU and the GPU
• Out of order vertices makes the CPU hit the
cache more often
• It does the same thing to the GPU
NVIDIA Corporation
How do we solve these problems?
• VTune
• GPT
• NVPAT
NVIDIA Corporation
VTune 4.5
•
•
•
•
Will help your application optimize for CPU
Works well in conjunction with NVPAT
I personally use the Time-Based Sampling Wizard
VTune is excellent for application specific
analysis
• It doesn’t show where in the driver time is spent,
unless you have symbols for the driver. You
almost certainly don’t have driver symbols.
NVIDIA Corporation
VTune 4.5
• Flare Application
NVIDIA Corporation
GPT 3.5
• Excellent tool to help you achieve maximum
performance.
• Works on both D3D and OpenGL
• Helps with application API slowdowns
• Works well in conjunction with VTune and NVPAT.
GPT is excellent for application to
Direct3D/OpenGL analysis.
• It still can’t tell you what is occurring inside the
driver that may be slowing your application down
NVIDIA Corporation
GPT 3.5 (cont)
• Quad view for visual analysis modes
View of alien world in Half-Life*
NVIDIA Corporation
NVPAT 1.07
•
•
•
•
•
•
•
•
Analyze interaction with driver
Works on NVIDIA hardware only
Windows 98/Windows 2000 capable
Hotkey capable
Online help via F1 function key
Logging
Frame Rate Display
Natural Extension to VTune and GPT
NVIDIA Corporation
NVPAT 1.07
• Demo – Flare VS NewFlare
• NVPAT Available free at
http://www.nvidia.com/Marketing/Developer/SwDe
vStaticPages.nsf/pages/StatsDriver
• You must be a registered NVIDIA developer
NVIDIA Corporation
VTune DLL SDK
• Soon, all these performance tools should be
integrated into VTune using the DLL SDK
• NVPAT will be integrated into the VTune DLL SDK
• VTune DLL SDK is available from Intel and gives
you the ability to integrate performance tools into
VTune.
http://developer.intel.com/vtune/analyzer/vtperfdll
• Common User Interface/API means less to learn
for developers
NVIDIA Corporation
Action Items
•
•
•
•
•
Profile often and early in the process
Use the tools available to you
Some are free, the rest are reasonable
Architect engine with concurrency in mind
Ask for enhancements from your tool vendor
NVIDIA Corporation
Questions?
• Comments/Suggestions?
• Enhancement requests for NVPAT can be sent to
[email protected]
NVIDIA Corporation