Profile Guided Optimizations in Visual C++ 2005 Andrew Pardoe Phoenix Team (C++ Optimizer)

Download Report

Transcript Profile Guided Optimizations in Visual C++ 2005 Andrew Pardoe Phoenix Team (C++ Optimizer)

Profile Guided Optimizations in
Visual C++ 2005
Andrew Pardoe
Phoenix Team (C++ Optimizer)
What do optimizers do?
int setArray(int a, int *array)
{
for(int x = 0; x < a; ++x)
array[x] = 0;
return x;
}
 The compiler knows nothing about the value of ‘a’
 The compiler knows nothing about the array’s alignment
 The compiler doesn’t look at all the source files together
 The compiler doesn’t know how the program will execute
What is PGO (pronounced PoGO)?





A “profile” details a program’s behavior in a specific
scenario
Profile-guided optimizations use the profile to guide the
optimizer for that given scenario
PGO tells the optimizer which areas of the application
were most frequently executed
This information lets the optimizer be more selective in
optimizing the program
PGO has its own set of optimizations as well as
improving traditional optimizations
Example of a PGO win

Compiler optimizations make assumptions based on static
analysis and standard heuristics

For example, we assume that a loop executes multiple times
for (p=list; *p; p=p->next) {
p->f = sqrt(F);
}

The optimizer would hoist the call to the loop invariant sqrt(F)
tmp = sqrt(F);
for (p=list; *p; p=p->next) {
p->f = tmp;
}

If the profile shows that p is zero, we will not hoist the call
How is PGO used?
Source code
Instrumented
binary
PGO Probes
Scenarios
Source code
Instrumented
binary
Profile
Profile
Optimized
binary
How is PGO used?



PGO is built on top of Link-Time Code Generation
Must link object files twice: once for instrumented build,
once for optimized build
Can be used on almost all native code




Cannot be used on system or managed code



exe, dll, lib
COM/MFC
Windows services
Drivers or kernel mode code
No code compiled with /CLR
Incorrect scenarios could cause worse optimizations!
PGO profile gathering

Two major themes of PGO profile gathering
 Identify “hot paths” in program execution path and
optimize to make these paths perform well
 Likewise, identify “cold paths” to separate cold code—
or dead code—from hot code
 Identify “typical” values such as switch values, loop
induction variables and targets of indirect calls and
optimize code for these values
PGO main optimizations: inlining

Improved inlining heuristics
 Inline based on frequency of call, not function size
or depth of call stack
 “Hot” call sites: inline agressively
 “Cold” call sites: only inline if there are other
optimization opportunities (such as folding)
 “Dead” call sites: only inline the trivial cases
PGO main optimizations: inlining


Speculative inlining: used for virtual call specification
 Indirect calls are profiled to find typical targets
 An indirect call heavily biased toward certain
target(s) can be multi-versioned
 The new sequence contains direct call(s) to
typical target(s), which can be inlined
Partial inlining: only inline the portions of the callee
we execute. If the cold code is called, call the noninlined function.
PGO main optimizations: code size

Choice of favoring size versus speed made on a perfunction basis


PGO computes a dynamic instruction count for each
profiled function.




Program execution should be dominated by functions optimized for
speed and less-frequently used functions should be small
Inlining effects are taken into account.
Sorts functions in descending order by count.
Functions in the upper 99% of total dynamic instruction
count are optimized for speed. Others are compressed.
In large applications (Vista, SQL) most functions are
optimized for size.
PGO main optimizations: locality






Reorder the code to “fall through” wherever possible
Intra-function layout reorders basic blocks so that the
major trace falls through whenever possible.
Inter-function layout tries to place frequent caller-callee
pairs near one another in the image.
Extract “dead” code from the .text section and put it in a
remote section of the image
Dead code can be entire functions that are not called or
basic blocks inside a function
Penalty for being wrong is very large so the profile must
be accurate!
What code benefits most?





C++ programs: many virtual calls can be inlined once the
target is determined through profiling
Large applications where size and speed are important
Code with frequent branches that are difficult to predict
at compile time
Code which can be separated by profiling into “hot” and
“cold” blocks to help instruction cache locality
Code for which you know the typical usage patterns and
can produce accurate profiling scenarios
Scenario 1




Customer compiles with /O2 and gets pretty good
performance but wants to take advantage of advanced
optimizations like LTCG and PGO
Code is tested by the dev team throughout development
cycle using unit and bug regression tests
Customer has done performance measurements of the
code. Customer has no automated tests to measure
performance but believes it can improve.
Is this customer ready to try PGO? Probably not.
Scenario 2





Customer has well-defined performance goals and tests
set up to measure performance
Customer knows typical usage patterns for the
application
Application is being built with LTCG
Most of the execution time is spent in tightly-nested
loops doing heavy floating-point calculations
Is this customer ready to use PGO? Maybe…
Scenario 3






Customer has well-defined performance goals and tests
set up to measure performance
Customer knows typical usage patterns for the
application
Application is being built with LTCG
Application spends most of its time in branches and calls
Application is fairly large and makes use of inheritance
Is this customer ready to use PGO? Definitely.
Scenario 4



Customer has a build lab and wants to enable PGO in
nightly builds
But profiling every night seems too expensive
Solution: PGO Incremental Update




Avoid running profile scenarios at every build
PGU uses “stale” profile data
Can check in profile data and refresh weekly
PGU restricts optimizations


Functions which have changed will not be optimized
Effects of localized changes are usually negligible
PGO sweeper

Some scenarios are difficult to collect profile data for






Profile scenario may not begin and end with application launch
and shutdown
Some components cannot write a file
Some components cannot link to the PGO runtime DLL
PGO sweeper collects profile data from running
instrumented processes
This allows you to close a currently open .pgc file and
create a new one without exiting the instrumented binary
You get one .pgc file per run or sweep.You can delete any
.pgc files you do not want reflected in your scenario.
PGO Manager





PGO manager adds profile data from one or more .pgc
files into the .pgd file
The .pgd file is the main profile database
Allows you to profile multiple scenarios (.pgc) for a single
codebase into one profile database (.pgd)
PGO manager also lets you generate reports from the
.pgd file to see that your scenarios “feel right” in the code
Information in the reports include




Module count, function count, arc and value count
Static (all) instruction count, dynamic (hot) instruction count
Basic block count, average basic block size
Function entry count
How much performance does PGO get?

Performance gain is architecture and application specific






IA64 sees biggest gains
x64 benefits more than x32
Large applications benefit more than small: SQL server saw
over 30% gains through PGO
Many parts of Windows use PGO to balance size vs. speed
If you understand your real-world scenarios and have
adequate, repeatable tests PGO is almost always a win
Once your testing is in-place integrating PGO into your
build process should be easy
Performance gains over LTCG
Call-graph profiling

Given this call graph, determine which code paths are hot
and which are cold
a
foo
bat
bar
baz
Call-graph profiling continued

Measure the frequency of calls
10
a
foo
bar
20
20
bat
baz
50
bar
100
100
75
75
50
15
bar
15
baz
15
baz
Call-graph profiling after inlining

Inline functions based on call profile

Highest-frequency calls are (bar, baz) and (bat, bar)
10
a
20
foo
125
bar
100
bat
baz
15
bar
baz
15
Reordering basic blocks

Change code layout to improve instruction cache locality
Execution profile
Default layout
Default layout
A
100
10
Optimized layout
Optimized layout
A
A
B
C
C
D
D
B
10
100
B
C
100
10
100
10
D
Speculative inlining of virtual calls

Profiling shows the dynamic type of object A in function
Func was almost always Foo (and almost never Bar)
class Base
{
…
virtual void call();
}
class Foo:Base
{
…
void call();
}
class Bar:Base
{
…
void call();
}
void Func(Base *A)
{
void
… Bar(Base *A)
{while(true)
{…
while(true)
…
{if(type(A) == Foo:Base)
{…
A->call();
// inline of A->call();
}…
}else // virtual dispatch
} A->call();
…
}
}
Partial inlining
Profiling shows that
condition Cond favors
the left branch over
the right branch
Basic Block 1
Cond
Hot Code
Cold Code
More Code
Partial inlining concluded
We can inline the hot
path, and not the cold
path. We can make different
decisions at each call site!
Basic Block 1
Cond
Cold Code
Hot Code
More Code
Using PGO (in more detail)
Source code
Object files
Scenarios
Compile with
/GL and opts
Link with
/LTCG:PGI
Instrumented
binary
.PGC files
Object files
.PGD file
Object files
Instrumented
binary
.PGD file
.PGC file(s)
Link with
/LTCG:PGO
Optimized
binary
PGO tips




The scenarios used to generate the profile data should be realworld scenarios. The scenarios are NOT and attempt to do
code coverage.
Using scenarios to train with that are not representative of
real-world use can result in code that performs worse than if
PGO was not used.
Name the optimized code something different from the
instrumented code, for example, app.opt.exe and app.inst.exe.
This way you can rerun the instrumented application to
supplement your set of scenario profiles without rerunning
everything again.
To tweak results, use the /clear option of pgomgr to clear out
a .PGD files.
PGO tips




If you have two scenarios that run for different amounts
of time, but would like them to be weighted equally, you
can use the weight switch (/merge:weight in pgomgr) on
.PGC files to adjust them.
You can use the speed switch to change the speed/size
thresholds.
You can control the inlining threshold with a switch but
use it with care. The values from 0-100 aren't linear.
Integrate PGO into your build process and update
scenarios frequently for the most consistent results and
best performance increases.
In summary

Using PGO is very easy, with four simple steps

CL to parse the source files


LINK / PGI to generate instrumented image



link /ltcg:pgi /pgd:appname.pgd *.obj *.lib
Also generates a PGD file (PGO database)
Run your program on representative scenarios


cl /c /O2 /GL *.cpp
Generates PGC files (PGO profile data)
LINK / PGO to generate optimized image


Implicitly uses the generated PGC files
link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
More information


Matt Pietrek’s Under the Hood column from May 2002 has
a fantastic explanation of LTCG internals
Multiple articles on PGO located on MSDN



The links are long: just search for PGO on MSDN
Look through articles by Kang Su Gatlin on his blog at
http://blogs.msdn.com/kangsu or on MSDN
Improvements are coming in the new VC++ backend



Based on the Phoenix optimization framework
Profiling is a major scenario for the Phoenix-based optimizer
There will be a talk on Phoenix later today