Common C++ Performance Mistakes in Games

Download Report

Transcript Common C++ Performance Mistakes in Games

Common C++ Performance
Mistakes in Games
Pete Isensee
Xbox Advanced Technology Group
About the Data
• ATG reviews code to find bottlenecks
and make perf recommendations
– 50 titles per year
– 96% use C++
– 1 in 3 use “advanced” features like
templates or generics
Why This Talk Is Important
• The majority of Xbox games are CPU
bound
• The CPU bottleneck is often a language
or C++ library issue
• These issues are not usually specific to
the platform
Format
•
•
•
•
Definition of the problem
Examples
Recommendation
For reference
– A frame is 17 or 33 ms (60fps / 30fps)
– Bottlenecks given in ms per frame
Issue: STL
• Game using std::list
– Adding ~20,000 objects every frame
– Rebuilding the list every frame
– Time spent: 6.5 ms/frame!
– ~156K overhead (2 pointers per node)
– Objects spread all over the heap
std::set and map
•
•
•
•
Many games use set/map as sorted lists
Inserts are slow (log(N))
Memory overhead: 3 ptrs + color
Worst case in game: 3.8 ms/frame
std::vector
• Hundreds of push_back()s per frame
• VS7.1 expands vector by 50%
• Question: How many reallocations for
100 push_back()s?
• Answer: 13!
(1,2,3,4,5,7,10,14,20,29,43,64,95)
Clearly, the STL is Evil
Use the Right Tool for the Job
•
•
•
•
The STL is powerful, but it’s not free
Filling any container is expensive
Be aware of container overhead
Be aware of heap fragmentation and
cache coherency
• Prefer vector, vector::reserve()
The STL is Evil, Sometimes
• The STL doesn’t solve every problem
• The STL solves some problems poorly
• Sometimes good old C-arrays are the
perfect container
• Mike Abrash puts it well:
– “The best optimizer is between your ears”
Issue: NIH Syndrome
• Example: Custom binary tree
– Sorted list of transparent objects
– Badly unbalanced
– 1 ms/frame to add only 400 items
• Example: Custom dynamic array class
– Poorer performance than std::vector
– Fewer features
Optimizations that Aren’t
void appMemcpy( void* d, const void* s, size_t b )
{
// lots of assembly code here ...
}
appMemcpy( pDest, pSrc, 100 ); // bottleneck
• appMemcpy was slower than memcpy
for anything under 64K
Invent Only What You Need
• std::set/map more efficient than the
custom tree by 10X
– Tested and proven
– Still high overhead
• An even better solution
– Unsorted vector or array
– Sort once
– 20X improvement
Profile
• Run your profiler
– Rinse. Repeat.
– Prove the improvement.
• Don’t rewrite the C runtime or STL just
because you can. There are more
interesting places to spend your time.
Issue: Tool Knowledge
• If you’re a programmer, you use C/C++
every day
• C++ is complex
• CRT and STL libraries are complex
• The complexities matter
• Sometimes they really matter
vector::clear
• Game reused global vector in frame loop
• clear() called every frame to empty the vector
• C++ Standard
– clear() erases all elements (size() goes to 0)
– No mention of what happens to vector capacity
• On VS7.1/Dinkumware, frees the memory
• Every frame reallocated memory
Zero-Initialization
struct Array { int x[1000]; };
struct Container {
Array arr;
Container() : arr() { }
};
Container x; // bottleneck
• Costing 3.5 ms/frame
• Removing : arr() speeds this by 20X
Know Thine Holy Standard
• Use resize(0) to reduce container size
without affecting capacity
• T() means zero-initialize PODs. Don’t
use T() unless you mean it.
• Get a copy of the C++ Standard. Really.
– www.techstreet.com; search on 14882
– Only $18 for the PDF
Issue: C Runtime
void BuildScore( char* s, int n )
{
if( n > 0 )
sprintf( s, “%d”, n );
else
sprintf( s, “” );
}
• n was often zero
• sprintf was a hotspot
qsort
• Sorting is important in games
• qsort is not an ideal sorting function
– No type safety
– Comparison function call overhead
– No opportunity for compiler inlining
• There are faster options
Clearly, the CRT is Evil
Understand Your Options
• itoa() can replace sprintf( s, “%d”, n )
• *s = ‘\0’ can replace sprintf( s, “” )
• std::sort can replace qsort
– Type safe
– Comparison can be inlined
• Other sorting options can be even
faster: partial_sort, partition
Issue: Function Calls
•
•
•
•
50,000-100,000 calls/frame is normal
At 60Hz, Xbox has 12.2M cycles/frame
Function call/return averages 20 cycles
A game calling 61,000 functions/frame
spends 10% CPU (1.7 ms/frame) in
function call overhead
Extreme Function-ality
• 120,000 functions/frame
• 140,000 functions/frame
• 130,000 calls to a single function/frame
(ColumnVec<3,float>::operator[])
• And the winner:
– 340,000 calls per frame!
– 9 ms/frame of call overhead
Beware Elegance
• Elegance → levels of indirection →
more functions → perf impact
• Use algorithmic solutions first
– One pass through the world
– Better object rejection
– Do AI/physics/networking less often than
once/frame
Inline Judiciously
• Remember: inline is a suggestion
• Try “inline any suitable” compiler option
– 15 to 20 fps
– 68,000 calls down to 47,000
• Try __forceinline or similar keyword
– Adding to 5 funcs shaved 1.5 ms/frame
• Don’t over-inline
Issue: for loops
// Example 1: Copy indices to push buffer
for( DWORD i = 0; i < dwIndexCnt; ++i )
*pPushBuffer++ = arrIndices[ i ];
// Example 2: Initialize vector array
for( DWORD i = 0; i < dwMax; ++i )
mVectorArr[i] = XGVECTOR4(0,0,0,0);
// Example 3: Process items in world
for( itr i = c.begin(); i < c.end(); ++i )
Process( *i );
Watch Out For For
• Never copy/clear a POD with a for loop
• std::algorithms are optimized; use them
memcpy( pPushBuffer, arrIndices,
dwIndexCnt * sizeof(DWORD) );
memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) );
for_each( c.begin(), c.end(), Process );
Issue: Exception Handling
•
•
•
•
Most games never throw
Most games never catch
Yet, most games enable EH
EH adds code to do stack unwinding
– A little bit of overhead to a lot of code
– 10% size increase is common
– 2 ms/frame in worst case
Disable Exception Handling
• Don’t throw or catch exceptions
• Turn off the C++ EH compiler option
• For Dinkumware STL
– Define “_HAS_EXCEPTIONS=0”
– Write empty _Throw and _Raise_handler; see
stdthrow.cpp and raisehan.cpp in crt folder
– Add #pragma warning(disable: 4530)
Issue: Strings
• Programmers love strings
• Love hurts
• ~7000 calls to stricmp in frame loop
– 1.5 ms/frame
• Binary search of a string table
– 2 ms/frame
Avoid strings
• String comparisons don’t belong in the
frame loop
• Put strings in an table and compare
indices
• At least optimize the comparison
– Compare pointers only
– Prefer strcmp to stricmp
Issue: Memory Allocation
• Memory overhead
– Xbox granularity/overhead is 16/16 bytes
– Overhead alone is often 1+ MB
• Too many allocations
– Games commonly do thousands of
allocations per frame
– Cost: 1-5 ms/frame
Hidden Allocations
• push_back(), insert() and friends
typically allocate memory
• String constructors allocate
• Init-style calls often allocate
• Temporary objects, particularly string
constants that convert to string objects
Minimize Per-Frame Allocations
• Use memory-friendly data structures, e.g.
arrays, vectors
• Reserve memory in advance
• Use custom allocators
– Pool same-size allocations in a single block of
memory to avoid overhead
• Use the explicit keyword to avoid hidden
temporaries
• Avoid strings
Other Tidbits
•
•
•
•
•
•
Compiler settings: experiment
dynamic_cast: just say no
Constructors: performance killers
Unused static array space: track this
Loop unrolling: huge wins, sometimes
Suspicious comments: watch out
– “Immensely slow matrix multiplication”
Wrap Up
• Use the Right Tool for the
Job
• The STL is Evil, Sometimes
• Invent Only What You Need
• Profile
• Know Thine Holy Standard
• Understand Your Options
•
•
•
•
•
•
Beware Elegance
Inline Judiciously
Watch Out For For
Disable Exception Handling
Avoid Strings
Minimize Per-frame
Allocations
Call to Action: Evolve!
• Pass the rubber chicken
– Share your C++ performance mistakes
with your team
• Mentor junior programmers
– So they only make new mistakes
• Don’t stop learning
– You can never know enough C++
Questions
• Fill out your feedback forms
• Email: [email protected]
• This presentation:
www.tantalon.com/pete.htm