Optimization Best Practices C++

Download Report

Transcript Optimization Best Practices C++

Tips and Tricks: Visual C++ 2005
Optimization Best Practices
Kang Su Gatlin
TLNL04
Program Manager
Visual C++
Microsoft Corporation
1
6 Tips/Best Practices To Help
Any C++ Dev Write Faster Code
Managed + Unmanaged
1. Pick the right level of optimization
2. Add instant parallelism
Unmanaged
3. Disambiguate memory
4. Use intrinsics
Managed
5. Avoid double thunks
6. Speed app startup time
2
1. Pick the Right Level Of Optimization
Builds from the Lab
If at all possible use Profile-Guided Optimization
Only available unmanaged
More on this next slide
If not, use Whole Program Optimization (/GL)
Available managed and unmanaged
After that we recommend
/O2 (optimize for speed) for hot functions/files
/O1 (optimize for size) for the rest
Other switches to use for maximum speed
/Gy
/OPT:REF,ICF (good size win on 64bit)
/fp:fast
/arch:SSE2 (will not work on downlevel architectures)
Debug Symbols Are NOT Only for Debug Builds
Executable size and codegen are NOT effected by this
It’s all in the PDB file
Always building debug symbols will make life easier
Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO
3
Next-Gen Optimizations Today
Profile Guided Optimization
The next level beyond Whole Program Optimization
Static compilers can’t answer everything
We get 20-50% improvement on large server applications
that we ship
Current support is unmanaged only
if(a < b)
foo();
else
baz();
for(i = 0; i < count; ++i)
bar();
Should we unroll this loop?
Should we inline foo()?
4
Profile Guided Optimization
Source
Compile with /GL
Object files
Link with /LTCG:PGI
Scenarios
Profile data
Instrumented
Image
Link with /LTCG:PGO
Object files
Instrumented
Image + PGD file
Output
Profile data
Optimized
Image
Object files
There is throughput impact
5
What PGO Does And Does Not Do
PGO does
Optimizations galore
Speed/Size Determination
Switch expansion
Better inlining decisions
Function/basic block layout
Virtual call speculation
Partial inlining
Optimize within a single image
Merging and weighting of multiple scenarios
PGO does not
No probing assembly language (inline or otherwise)
No optimizations across DLLs
No data layout optimization
6
PGO Compilation in Visual C++ 2005
7
2. Add Instant Parallelism
Just add OpenMP Pragmas!
OpenMP is a popular API for
multithreaded programs
Born from the HPC community
It consists of a set of simple #pragmas and
runtime routines
Most value parallelizing large loops with no
loop-dependencies
Visual C++ 2005 implements the full
OpenMP 2.5 standard
Full unmanaged and/clr managed support
See the PDC issue of MSDN magazine for an article
on OpenMP
8
OpenMP Parallelization
void test(int first, int last) {
#pragma omp parallel for
for (int i = first;
i <= last; ++i) {
a[i] = b[i] * c[i];
}
}
if(x < 0)
a = foo(x);
else
a = x + 5;
b = bat(y);
c = baz(x + y);
j = a*b+c;
Each iteration
Assignments
tois‘a’,
independent;
‘b’,
and ‘c’ are
order
of execution
independent
does not matter
#pragma omp parallel
sections
{
#pragma omp section
if(x < 0)
a = foo(x);
else
a = x + 5;
#pragma omp section
b = bat(y);
#pragma omp section
c = baz(x + y);
}
j = a+b+c;
9
OpenMP Case Study
Panorama Factory by Smoky City Design
Top-rated image stitching application
Added multithreading with OpenMP in
Visual C++ 2005 Beta2
Used 102 instances of #pragma omp *
Extremely impressive Results…
Stitching together several large images
Dual processor, dual core x64 machine
10
Panorama Factory Speed Up Using OpenMP
Speed Up Relative to Single-Threaded
Performance
3.5
3
2.5
2
1.5
1
Speed Up including I/O
0.5
Speed Up not including
I/O
0
1
2
3
4
Number of Threads
11
3. Disambiguate Memory
Programmer knows a and b never overlap
void copy8(int * a,
int * b) {
a[0] = b[0];
a[1] = b[1];
a[2] = b[0];
a[3] = b[1];
a[4] = b[0];
a[5] = b[1];
a[6] = b[0];
a[7] = b[1];
}
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
ecx = a, eax = b
edx, DWORD PTR [eax]
DWORD PTR [ecx], edx
edx, DWORD PTR [eax+4]
DWORD PTR [ecx+4], edx
edx, DWORD PTR [eax]
DWORD PTR [ecx+8], edx
edx, DWORD PTR [eax+4]
DWORD PTR [ecx+12], edx
edx, DWORD PTR [eax]
DWORD PTR [ecx+16], edx
edx, DWORD PTR [eax+4]
DWORD PTR [ecx+20], edx
edx, DWORD PTR [eax]
DWORD PTR [ecx+24], edx
eax, DWORD PTR [eax+4]
DWORD PTR [ecx+28], eax
12
Aliasing And Memory
Disambiguation
Aliasing is when one object can be used as an
alias to another object
If compiler can NOT prove that an object does
not alias then it MUST assume it can
How can we address some of these problems?
1.
2.
3.
4.
Avoid taking address of an object.
Avoid taking address of a function.
Avoid using global variables. Statics are preferable.
Use __restrict, __declspec(noalias), and __declspec(restrict)
when possible.
13
__restrict – A compiler hint
Programmer knows a and b don’t overlap
void copy8(int * __restrict a,
int * b) {
a[0] = b[0];
a[1] = b[1];
a[2] = b[0];
a[3] = b[1];
a[4] = b[0];
a[5] = b[1];
a[6] = b[0];
a[7] = b[1];
}
mov
mov
mov
mov
mov
mov
mov
mov
mov
mov
eax = a, edx = b
ecx, DWORD PTR [edx]
edx, DWORD PTR [edx+4]
DWORD PTR [eax], ecx
DWORD PTR [eax+4], edx
DWORD PTR [eax+8], ecx
DWORD PTR [eax+12], edx
DWORD PTR [eax+16], ecx
DWORD PTR [eax+20], edx
DWORD PTR [eax+24], ecx
DWORD PTR [eax+28], edx
14
__declspec(restrict)
Tells the compiler that the function returns an
unalised pointer
Only applicable to functions
This is a promise the programmer makes to
the compiler
If this promise is violated the compiler may
generate bad code
The CRT uses this decoration, e.g., malloc,
calloc, etc…
__declspec(restrict) void *malloc(int size);
15
__declspec(noalias)
Tells the compiler that the function is a
semi-pure function
Only references locals, arguments, and first-level
indirections of arguments
This is a promise the programmer makes to
the compiler
If this promise is violated the compiler may
generate bad code
__declspec(noalias) void isElement(Tree *t, Element e);
16
4. Use Intrinsics
Simply represented as functions to the programmer
_mm_load_pd(double const*);
Compilers understand these as primitives
Allows the user to get right at the hardware w/o
using asm
Almost anything you can do in assembly
interlock, memory fences, cache control, SIMD
The key to things such as vectorization and
lock-free programming
You can use intrinsics in a file compiled /clr, but the
function(s) will be compiled as unmanaged
Intrinsics are consumed by PGO and our optimizer
Inline asm is not
Documentation for intrinsics is much better in
Visual C++ 2005
[Visual Studio 8]\VC\include\intrin.h
17
Matrix Addition With Intrinsics
void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) {
for(int i = 0; i < a.m_rows; ++i)
for(int j = 0; j < a.m_cols; j++)
c[i][j] = a[i][j] + b[i][j];
}
#include <intrin.h>
void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) {
__m128 aSIMD, bSIMD, cSIMD;
for(int i = 0; i < a.m_rows; ++i)
for(int j = 0; j < a.m_cols; j += 4)
{
aSIMD = _mm_load_ps(&a[i][j]);
bSIMD = _mm_load_ps(&b[i][j]);
cSIMD= _mm_add_ps(aSIMD, bSIMD);
_mm_store_ps(&c[i][j], cSIMD);
}
}
18
Spin-Lock With Intrinsics
#include <intrin.h>
#include <windows.h>
void EnterSpinLock(volatile long &lock) {
while(_InterlockedCompareExchange(&lock, 1, 0) != 0)
Sleep(0);
}
void ExitSpinLock(volatile long &lock) {
lock = 0;
}
19
5. Avoid Double-Thunks
Thunks are functions used to
transition from managed to
unmanaged (and vice-versa)
Managed Code
UnmanagedFunc();
Managed To
Unmanaged
Thunk
Unmanaged Code
UnmanagedFunc() { … }
Thunks are a part of life…
but sometimes we can have
Double Thunks…
20
Double Thunking
From managed to managed only
Indirect calls
Function pointers and virtual functions
Is the callee is managed or unmanaged entry point?
__declspec(dllexport)
No current mechanism to export functions as managed entry points
Managed Code
ManagedFunc();
Managed To
Unmanaged
Thunk
Managed Code
ManagedFunc() { … }
Unmanaged To
Managed
Thunk
21
How To Fix Double Thunking
Indirect Functions (including Virtual Funcs)
Compile with /clr:pure
Use __clrcall
__declspec(export)
Wrap functions in a managed class, and then
#using the object file
22
Using __clrcall To
Improve Performance
23
6. Speed App Startup Time
No one likes to wait for an app to start-up
There is still some time associated with
loading CLR
In some apps you may have non-CLR paths
Only load the CLR when you need to
Use DelayLoading technology in the linker
If the EXE is compiled /clr then we will always
load the CLR
24
Delay Loading The CLR
25
Summary Of Best Practices
Large and ongoing investment in managed and
unmanaged C++ code
Managed + Unmanaged
1. Use PGO for unmanaged and WPO for managed…
2. OpenMP can ease multithreaded development.
Unmanaged
3. Make it easier for the compiler to track pointers.
4. Intrinsics give the ability to get to the metal.
Managed
5. Know where your double thunks are and fix.
6. Delay load the CLR to improve startup.
26
Resources
Visual C++ Dev Center
http://msdn.microsoft.com/visualc
This is the place to go for all our news
and whitepapers
Myself
[email protected]
http://blogs.msdn.com/kangsu
Must See Talks
TLN309 C++: Future Directions in Language
Innovation with Herb Sutter (Friday 10:30am)
27
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
28