Visual C++ 2005 New Optimizations

Download Report

Transcript Visual C++ 2005 New Optimizations

Visual C++
Optimizations
Jonathan Caves
Principal Software Engineer
Visual C++
Microsoft Corporation
How can your application run faster?
► Maximize
optimization for each file.
► Whole Program Optimization (WPO) goes beyond
individual files.
► Profile Guided Optimization (PGO) specializes
optimizations specifically for your application.
► New Floating Point Model.
► OpenMP
► 64bit Code Generation.
Maximum Optimization for Each File
►
Compiler optimizes each source code file to get best
runtime performance
 The only type optimization available in Visual C++ 6
►
Visual C++ 2005 added better optimization algorithms
 Specialized support for newer processors such as Pentium 4
 Improved speed and better precision of floating point operations
 New optimization techniques like loop unrolling
►
Typical expectation for performance after rebuild
 10-20% improvement from Visual C++ 6 to Visual C++ 2002
 20-30% improvement from Visual C++ 6 to Visual C++ 2005
Whole Program Optimization
►
►
Typically Visual C++ will optimize programs by generating
code for object files separately
Introducing whole program optimization
 First introduced with Visual C++ 2002 and has since improved
 Compiler and linker set with new options (/GL and /LTCG)
 Compiler has freedom to do additional optimizations
► Cross-module
inlining
► Custom calling conventions
 Visual C++ 2005 supports this on all platforms
 Whole program optimizations is widely used for Microsoft products
such as SQL Server
►
Typically expect significant performance improvement
 About 30% improvement from Visual C++ 2003 to Visual C++
2005
Profile Guided Optimization
Static analysis leaves many open optimization questions for
the compiler, leading to conservative optimizations
► Visual C++ programs can be tuned for expected user
scenarios by collecting information from running
application
► Introducing profile guided optimization
►




Optimizing code by using program in a way how its customer use it
Runs optimizations at link time like whole program optimization
Available in Visual Studio 2005
Is it common for p
Widely adopted in Microsoft
to be NULL?
If it is not common for
p to be NULL, the error
code should be
collected with other
infrequently used code
if (p != NULL) {
/* Perform action with p */
} else {
/* Error code */
}
PGO: Instrumentation
► We
instrument with “probes” inserted into the
code
► Two main types of probes
 Value probes
► Used
to construct histogram of values
 Count (simple/entry) probes
► Used
to count number of times a path is taken
► We
try to insert the minimum number of probes to
get full coverage
 Minimizes the cost of instrumentation
PGO Optimizations
► Switch
expansion
► Better inlining decisions
► Cold code separation
► Virtual call speculation
► Partial inlining
Profile Guided Optimization
Source
Compile with /GL &
Optimizations On (e.g. /O2)
Object files
Object files
Link with /LTCG:PGI
Instrumented
Image
Scenarios
Instrumented
Image
Output
Profile data
Profile data
Link with /LTCG:PGO
Object files
Optimized
Image
PGO: Inlining Sample
► Profile
Guided uses call graph path
profiling.
a
foo
bat
bar
baz
PGO: Inlining Sample (Cont)
► Profile
Guided uses call graph path
profiling.
10
a
75
bar
20
foo
50
bar
100
bat
baz
baz
15
bar
15
baz
PGO – Inlining Sample (cont)
► Inlining
site.
decisions are made at each call
10
a
20
foo
125
bar
100
bat
baz
15
bar
baz
15
PGO – Switch Expansion
►
Most frequent values are pulled out.
// 90% of the
// time i = 10;
switch (i) {
case 1: …
case 2: …
case 3: …
default:…
}
if (i == 10)
goto default;
switch (i) {
case 1: …
case 2: …
case 3: …
default:…
}
PGO – Code Separation
Basic blocks are ordered so that most
frequent path falls through.
Default layout
A
100
Optimized layout
A
A
B
B
C
D
D
C
10
B
C
100
10
D
PGO – Virtual Call Speculation
The type of object A in function Func was almost always
Foo via the profiles
class Base{
…
virtual void call();
}
class Foo:Base{
…
void call();
}
class Bar:Base {
…
void call();
}
void Func(Base *A)
{
void
…Func(Base *A)
{ while(true)
…{
…
while(true)
if(type(A) ==
{
Foo:Base)
…
{
A->call();
// inline of
…
A->call();
}
}
}
else
A->call();
…
}
}
PGO – Partial Inlining
Basic Block 1
Cond
Hot Code
Cold Code
More Code
PGO – Partial Inlining (cont)
Basic Block 1
Hot path is inlined,
but NOT the cold
Cond
Hot Code
Cold Code
More Code
Demo
Optimizing applications with
Visual C++
New Floating Point Model
► /Op
made your code run slow
 No intermediate switch
► New




Floating Point Model
/fp:fast
/fp:precise (default)
/fp:strict
/fp:except
/fp:precise
►The
default floating point switch
►Performance and Precision
►IEEE Conformant
►Round to the appropriate precision
 At assignments, casts and function calls
/fp:fast
► When
performance matters most
► You know your application does simple floating
point operations
► What can /fp:fast do?






Association
Distribution
Factoring inverse
Scalar reduction
Copy propagation
And others…
/fp:except
►Reliable
floating point exceptions
►Thrown and not thrown when expected
 Faults and traps, when reliable, should
occur at the line that causes the
exception
 FWAITs on x86 might be added
►Cannot
be used with /fp:fast and in
managed code
/fp:strict
►The
strictest FP option
 Turns off contractions
 Assumes floating point control word can
change or that the user will examine flags
►/fp:except
is implied
►Low double digit percent slowdown
versus /fp:fast
What is the output?
#include <stdio.h>
int main()
{
double x, y, z;
/fp:fast /O2 = o.ooo
double sum;
x = 1e20;
/fp:strict /O2 = 10.0
y = -1e20;
z = 10.0;
sum = x + y + z;
printf ("sum=%f\n",sum);
}
OpenMP
 A specification for writing multithreaded
programs
 It consists of a set of simple #pragmas and
runtime routines
 Makes it very easy to parallelize loop-based
code
 Helps with load balancing, synchronization,
etc…
 In Visual Studio, only available in C++
OpenMP Parallelization
►
►
Can parallelize loops and straight-line code
Includes synchronization constructs
void test(int first, int last) {
#pragma omp parallel for
for (int i = first; i <= last; ++i) {
a[i] = b[i] + c[i];
}
}
first = 1
last = 1000
1 ≤ i ≤ 250
251 ≤ i ≤ 500
501 ≤ i ≤ 750
751 ≤ i ≤ 1000
64bit Compilers
► 64bit
Compiler Cross Tools
 Compiler is 32bit but resulting image is 64bit
► 64bit
Compiler Native Tools
 Compiler and resulting image are 64bit binaries.
► All
previous optimizations apply for 64bit as
well.
Resources
► Visual
C++ Dev Center
 http://msdn.microsoft.com/visualc
 This is the place to go for all our news and
whitepapers
 Also VC2005 specific forums at
http://forums.microsoft.com
► Myself
 [email protected]