Scalar and Serial Optimization Financial Services Engineering Software and Services Group Intel Corporation.

Download Report

Transcript Scalar and Serial Optimization Financial Services Engineering Software and Services Group Intel Corporation.

Scalar and Serial Optimization
Financial Services Engineering
Software and Services Group
Intel Corporation
Agenda
• Objective
• Algorithmic and Language
• Precision, Accuracy, Function Domain
• Lab Step 2 Scalar, Serial Optimization
• Summary
2
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Objective
Objective of Scalar and Serial Optimization
• Obtain the most efficient implementation for the problem at
hand
• Identify the opportunity for vectorization and parallelization
• Create Base to account for vectorization and parallelization
gains
– Avoid situation when vectorized, slower code was parallelized
and create a false impression of performance gain
4
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Algorithmic and Language
Algorithmic Optimizations
• Elevate constants out of the core loops
– Compiler can do it, but it need your cooperation
– Group constants together
• Avoid and replace expensive operations
– divide a constant can usually be replace by multiplying its
reciprocal
• Strength reduction in hot loop
– People like inductive method, because it’s clean
– Iterative can strength reduce the operation involved
– In this example, exp() is replace by a simple multiplication
const double
dt = T / (double)TIMESTEPS;
const double
vDt = V * sqrt(dt);
for(int i = 0; i <= TIMESTEPS; i++){
double price = S * exp(vDt * (2.0 * i - TIMESTEPS));
cell[i] = max(price - X, 0);
}
6
const double factor = exp(vDt * 2);
double
price = S * exp(-vDt(2+TIMESTEPS));
for(int i = 0; i <= TIMESTEPS; i++){
price = factor * price;
cell[i] = max(price - X, 0);
}
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Understand C/C++ Type Conversion Rule
• C/C++ implicit type conversion rule
– double is higher in the type hierarchy than float in C/C++
– A variables promotes to double if it operates with another double.
– 0.5*V*V will trigger a implicit conversion if V is a float
– double is at least 2X slower than float
– Type convert is very expensive. It is 6 cycles inside VPU engine
• Avoid using floating point literals, Always type your constants
– Use const float HALF = 0.5f;
• Choose the right runtime functions API calls
– sqrt(), exp(), log() requires double parameter
– sqrtf(), expf(), logf() takes float parameter
7
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Use Mathematical Equivalence
• Direct implementation of mathematical formula can result in
redundant computation
– Understand your target machine
– Transform your calculation to the basic operations
– Reuse as much as you can previous results
• Reduced add/multiply operations make the result more accurate
• Examples: Black-Scholes Formula
CND ( x)  1  CND ( x)
c  SCND (d1 )  X e  rT CND (d 2 )
p  X e  rT CND (d 2 )  SCND (d1 )
d1 
d2 
8
ln( S
X
)  (r  v
2
2
)T
v T
ln( S
X
)  (r  v
v T
2
2
)T
d1 
d2 
c  SCND (d1 )  X e  rT CND (d 2 )
p  X e  rT  S  c
ln( S
X
)  (r  v
2
2
)T
d1 
v T
ln( S
X
)  (r  v
v T
2
2
)T  v 2T
ln( S
X
)  (r  v
2
2
)T
v T
d 2  d1  v T
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Precision, Accuracy and Domain
Understand the floating point arithmetic Unit
• Vector Processing Unit executing vector FP instruction
• X87 unit also exist can execute FP Instruction as well
• Compiler choose which place to use for FP operation
• VPU is preferred place because of its speed
– VPU can make the FP results reproducible as well
• Use X87 should be used for two reasons
– Reproduce the same results 15 years ago, right or wrong
– Need generate FP exceptions for debugging purpose
• Intel Compiler default to VPU the user can override with
–fp-model strict
10
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Choose a Right Precision for your problem
• Under the precision requirement of your problem
– For some algorithm single precision is good enough
– Example 1: Newton-Raphson function approximation
– Example 2: Monte Carlo if rounding error is controlled
• SP will always be faster by at least 2x
• Mixed precision is also an option
• Conversion between two FP formats are not free
Parameter
Single
Double
Extended Precision (IEEE_X)*
Format width in bits
32
64
128
Sign width in bits
1
1
1
mantissa
23
52
112 (113 implied)
Exponent width in bits
8
11
15
Max binary exponent
+127
+1023
+16383
Min binary exponent
- 126
- 1022
-16382
Exponent bias
+127
+1023
+16383
Max value
~3.4 x 1038
~1.8 x 10-308
~1.2 x 10-4932
Value (Min normalized)
~1.2 x 10-38
~2.2 x 10-308
~3.4 x 10-4932
Value (Min
denormalized)
11
~1.4 x 10-45
~4.9 x 10-324
~6.5 x 10-4966
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Use the Right Accuracy Mode
• Accuracy affects the performance of your program
• Choose the accuracy your problem requires
• Mix accuracies have the same accuracy as the lower ones
• Choices for Accuracy
– Intel MKL Accuracy Mode HA, LA, EP: API calls
vmlSetMode(VML_EP);
– Intel® Compiler: Compiler switches
–fimf_precision=low,high,medium
-fimf_accuracy_bits=11
12
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Understand the Domain of Your Problem
• 80/20 rules in computer arithmetic: 20% of time spent on
getting good for 80% input, 80% of time spent on getting the
corner case right
• Every function call has to check NaN, denomals, etc
• Exclude corner cases can result in higher performance
• Intel Compiler support domain exclusion
– Use -fimf-domain-exclusion=<n1>
13
– <n1> exclusive or of bit masks
Values to Exclude
Mask
– 15: common exclusions
none
0
– 31: avoid all corner case
Extreme value
1
NaNs
2
Infinities
4
Denormals
8
Zeros
16
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Combination compiler Switches
• Lowest precision sequence for SP/DP
-fimf-precision=low -fimf-domain-exclusion=15
• Low precision for DP
-fimf-domain-exclusion=15 -fimf-accuracy-bits=22
• Low precision for SP even lower for DP
-fimf-precision=low -fimf-domain-exclusion=11
• Lower accuracy than default 4 ulps, higher than above
-fimf-max-error=2048 -fimf-domain-exclusion=15
• Adding the option -fimf-domain-exclusion=15 to the default
-fp-model fast=2
• Vectorized, high precision of division, square root and
transcendental functions from libsvml
-fp-model-precise –no-prec-div –no-prec-sqrt –fasttranscendentals –fimf-precision=high
14
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Lab Step 2 Scalar, Serial
Optimization
Step 2 Scalar, Serial Optimization
• Inspect your source code for language related
inefficiencies
– Type your constants
– Explicit about C/C++ run time API
• Experiment your precision and accuracy setting
– -fimf-precision=low
– -no-prec-div
– -no-prec-sqrt
• Experiment your Domain Exclusion
– -fimf-domain-exclusion=15
16
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Summary
Summary
• Optimize your Algorithm first
• Avoid unexpected C/C++ Type Conversions
• Choose the right representation, Accuracy level
• Experiment fp-model fast=2 with Intel Compiler
18
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
19
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Compiler Support
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
FP switches for Intel Compiler and GCC
-fp-model for Intel Compiler
•
fast [=1]
optimized for performance (default)
•
fast
aggressive approximation approximations
•
precise
•
source|double|extended imply “precise” unless overridden
•
except
•
=2
value-safe optimizations only
enable floating point exception semantics
precise + except + disable fma + don’t assume default
floating-point environment
strict
Floating Point controls in GCC
• f[no-]fast-math
is high level option
• It is off by default
(different from Intel Compiler)
• Ofast
turns on –ffast-math
• funsafe-math-optimizations turn on reassociation
• Reproducibility of exceptions
• Assumptions about floating-point environment
Optimization Notice
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Value Safety
In SAFE mode, the compiler’s hands are tied
All the following are prohibited.
x / x  1.0
x could be 0.0, ∞, or NaN
x – y  - (y – x)
If x equals y, x – y is +0.0 while – (y – x) is -0.0
x – x  0.0
x could be ∞ or NaN
x * 0.0  0.0
x could be -0.0, ∞, or NaN
x + 0.0  x
x could be -0.0
(x + y) + z  x + (y + z)
General re-association is not value safe
(x == x)  true
x could be NaN
Optimization at Stake
• Reassociation
• Flush-to-zero
• Expression Evaluation, various mathematical simplifications
• Approximate divide and sqrt
• Math library approximations
Optimization Notice
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
23
11/6/201
Floating-Point Behavior
Floating-point exception flags are set by Intel IMCI
• unmasking and trapping is not supported.
• attempts to unmask will result in seg fault
• -fp-trap (C) are disabled
• -fp-model except or strict will yield (slow!) x87 code that supports unmasking
and trapping of floating-point exceptions
Denormals are supported by Intel IMCI
• Needs –no-ftz or –fp-model precise
(like host)
512 bit vector transcendental math functions available
• 4 elementary functions are available RECIP, RSQRT, EXP2, LOG2
• DIV and SQRT benefit from these 4 function
• SVML can even be inlined avoid function call overhead
• Many options to select different implementations
• See Differences in floating-point arithmetic between Intel(R) Xeon processors and the
Intel Xeon Phi(TM) coprocessor for details and status
24
Optimization Notice
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Further Information
• Microsoft Visual C++* Floating-Point Optimization
http://msdn2.microsoft.com/en-us/library/aa289157(vs.71).aspx
• The Intel® C++ and Fortran Compiler Documentation,
“Floating Point Operations”
“Consistency of Floating-Point Results using the Intel® Compiler”
http://software.intel.com/en-us/articles/consistency-of-floating-pointresults-using-the-intel-compiler/
“Differences in Floating-Point Arithmetic between Intel® Xeon®
Processors and the Intel® Xeon Phi™ Coprocessor”
http://software.intel.com/sites/default/files/article/326703/floatingpoint-differences-sept11.pdf
• Goldberg, David: "What Every Computer Scientist Should Know About
Floating-Point Arithmetic“ Computing Surveys, March 1991, pg. 203
Optimization Notice
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
25
11/6/201