Vectorization Shuo Li Financial Services Engineering Software and Services Group Intel Corporation Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

Download Report

Transcript Vectorization Shuo Li Financial Services Engineering Software and Services Group Intel Corporation Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

Vectorization
Shuo Li
Financial Services Engineering
Software and Services Group
Intel Corporation
Legal Notices
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS
ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION
IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or
characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have
no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to
change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change
without notice.
This document contains information on products in the design phase of development.
Cilk, Core Inside, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, the Intel Sponsors of Tomorrow. logo, Intel
StrataFlash, Intel vPro, Itanium, Itanium Inside, MCS, MMX, Pentium, Pentium Inside, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside,
XMM, are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or
other countries.
Copyright © 2012 Intel Corporation. All rights reserved.
2
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Agenda
• Vectorization Overview
• Compiler-base Autoectorization
• Step 3 Vectorization
• Intel Cilk Plus for Vectorization
• Intel C/C++ Vector Classes
• Summary
3
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Vectorization Overview
Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
iXPTC 2013
5
Intel® Xeon Phi ™Coprocessor
Vectorization and SIMD Execution
• SIMD
– Flynn’s Taxonomy: Single Instruction, Multiple Data
– CPU perform the same operation on multiple data elements
• SISD
– Single Instruction, Single Data
• Vectorization
– In the context of Intel® Architecture Processors, the process of
transforming a scalar operation (SISD), that acts on a single data
element to the vector operation that that act on multiple data
elements at once(SIMD).
– Assuming that setup code does not tip the balance, this can result
in more compact and efficient generated code
– For loops in ”normal” or ”unvectorized” code, each assembly
instruction deals with the data from only a single loop iteration
6
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
SIMD Abstraction – Vectorization/SIMD
for (i = 0; i < 15; i++)
if (v5[i] < v6[i])
v1[i] += v3[i];
SIMD can simplify your code and reduce the jumps,
breaks in program flow control
Note the lack of jumps or conditional code branches
v5 = 0 4 7 8 3 9 2 0 6 3 8 9 4 5 0 1
v6 = 9 4 8 2 0 9 4 5 5 3 4 6 9 1 3 0
vcmppi_lt k7, v5, v6
k7 = 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
v3 = 5
v1 = 1
vaddpi
v1 = 6
6 7 8 5
1 1 1 1
v1{k7},
1 8 1 1
6 7
1 1
v1,
1 8
8 5 6 7 8 5 6 7 8
1 1 1 1 1 1 1 1 1
v3
9 1 1 1 1 6 1 8 1
iXPTC 2013
7
Intel® Xeon Phi ™Coprocessor
Software Behind the Vectorization
float *restrict A, *B, *C;
for(i=0;i<n;i++){
A[i] = B[i] + C[i];
• [SSE2] 4 elems at a time
addps xmm1, xmm2
• [AVX] 8 elems at a time
vaddps ymm1, ymm2, ymm3
}
Vector (or SIMD) Code computes more
• [IMCI] 16 elems at a time
than one element at a time.
vaddps zmm1, zmm2, zmm3
IMIC
AVX
512
384 383
SSE 2
X87
128 127
256 255
0
X15
X14
X13
X12
X11
X10
X9
X8
X7
X6
X5
X4
X3
X2
X1
X0
Y15
Y14
Y13
Y12
Y11
Y10
Y9
Y8
Y7
Y6
Y5
Y4
Y3
Y2
Y1
Y0
X8opY8
X7opY7
X6opY6
X5opY5
X4opY4
X3opY3
X2opY2
X1opY1
X0opY0
X15opY15 X14opY14 X13opY13 X12opY12 X11opY11 X10opY10 X9opY9
8
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Hardware resources behind the vectorization
• CPU has a lot of
computation power in the
form of SIMD unit.
• YMM (256bit) can operate
• XMM (128bit) can operate
• Intel® Xeon Phi™ Coprocessor
(512bit) can operate
–
–
–
–
16x chars
8x shorts
4x dwords/floats
2x qwords/doubles/float
complex
–
–
–
–
–
–
–
–
–
32x chars
16x shorts
8x dwords/floats
4x qwords/doubles/float complex
2x double complex
16x chars/shorts (converted to int)
16x dwords/floats
8x qwords/doubles/float complex
4x double complex
9
SIMD Abstraction – Options Compared
Compiler-based autovectorization annotation
#pragma vector, #pragma ivdep,#pragma simd
Ease of use / code
maintainability (depends
on problem)
Intel® Cilk™ Plus technology
Elemental Functions and Array Notation:
C/C++ Vector Classes (F32vec16, F64vec8)
Vector intrinsics (mm_add_ps, addps)
Programmer control
iXPTC 2013
10
Intel® Xeon Phi ™Coprocessor
Compiler-based Autovectorization
Compiler-Based Autovectorization
• Compiler recreate vector instructions from the serial Program
• Compiler make decisions based on some assumption
• The programmer reassures the compiler on those assumptions
– The compiler takes the directives and compares them with its
analysis of the code
#pragma simd reduction(+:sum)
for(i=0;i<*p;i++) {
for(i=0;i<*p;i++) {
a[i] = b[i]*c[i];
checks forthis loop :
• Compiler Confirms
a[i] = b[i]*c[i];
sum = sum + a[i];
sum = sum + a[i];
Is “*p”
loop invariant
invariant?
– “*p”
is loop
}
}
Are is
a,not
b, and
c loop
invariant?
– a[]
aliased
with
b[], c[], and sum
Doesisa[]
overlap
sum?
– sum
not
aliasedwith
withb[],
b[]c[],
andand/or
c[]
operatoronassociative?
(Does the
order ofcan
“add”s
matter?)
– Is
“+”“+”
operation
sum is associative
(Compiler
reorder
the
on sum)
– “add”s
Vector computation
on the target expected to be faster than scalar
code? code to be generated even if it could be slower than
– Vector
scalar code
12
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Hints to Compiler for Vectorization Opportunities
#pragma
Semantics
#pragma ivdep
Ignore vector dependences unless they are proven by the
compiler
#pragma vector always [assert]
If the loop is vectorizable, ignore any benefit analysis
If the loop did not vectorize, give a compile-time error
message via assert
#pragma novector
Specifies that a loop should never be vectorized, even if it is
legal to do so, when avoiding vectorization of a loop is
desirable (when vectorization results in a performance
regression)
#pragma vector aligned /
unaligned
instructs the compiler to use aligned (unaligned) data
movement instructions for all array references when
vectorizing
#pragma vector temporal /
nontemporal
directs the compiler to use temporal/non-temporal (that is,
streaming) stores on systems based on IA-32 and Intel® 64
architectures; optionally takes a comma separated list of
variables
iXPTC 2013
13
Intel® Xeon Phi ™Coprocessor
Demand vectorization by annotation
- #pragma simd
• Syntax: #pragma simd [<clause-list>]
– Mechanism to force vectorization of a loop
– Programmer: asserts a loop ought to be vectorized
– Compiler: vectorizes the loop or gives an error
Clause
Semantics
No clause
Enforce vectorization of innermost loops; ignore dependencies etc
vectorlength (n1[, n2]…)
Select one or more vector lengths (range: 2, 4, 8, 16) for the
vectorizer to use.
private (var1, var2, …, varN)
Scalars private to each iteration. Initial value broadcast to all
instances. Last value copied out from the last loop iteration
instance.
linear (var1:step1, …, varN:stepN)
Declare induction variables and corresponding positive integer
step sizes (in multiples of vector length)
reduction (operator:var1, var2,…, varN)
Declare the private scalars to be combined at the end of the loop
using the specified reduction operator
[no]assert
Direct compiler to assert when the vectorization fails. Default is
to assert for SIMD pragma.
iXPTC 2013
14
Intel® Xeon Phi ™Coprocessor
Annotate Black-Scholes for Vectorization
#pragma simd vectorlength(64)
#pragma vector aligned
#pragma vector nontemporal (CallResult, PutResult)
for(int opt = 0; opt < OptPerThread; opt++)
{
float CNDD1;
float CNDD2;
float T = OptionYears[opt];
float X = OptionStrike[opt];
float S = StockPrice[opt];
float sqrtT = sqrtf(T);
float d1 = log2f(S/X)/(VLOG2E*sqrtT) + RVV*sqrtT;
float d2 = d1 - VOLATILITY * sqrtT;
CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1);
CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2);
float XexpRT = X*exp2f(RLOG2E * T);
float CallVal = S * CNDD1 - XexpRT * CNDD2;
float PutVal = CallVal + XexpRT - S;
CallResult[opt] = CallVal ;
PutResult[opt] = PutVal ;
}
bs_sp.c(174):
bs_sp.c(196):
bs_sp.c(196):
bs_sp.c(190):
bs_sp.c(189):
bs_sp.c(235):
15
(col.
(col.
(col.
(col.
(col.
(col.
2)
3)
3)
6)
2)
3)
remark:
remark:
remark:
remark:
remark:
remark:
Compiler Invocation Options:
-fno-alias
No pointer aliasing in the program.
-[no-]restrict –std=c99
Enable/disable restrict keyword for pointer
disambiguation.
-vec-report[n]
-opt-report-phase hpo
Turn on the vectorization report.
loop was not vectorized: existence of vector dependence.
pragma supersedes previous setting.
SIMD LOOP WAS VECTORIZED.
loop was not vectorized: not inner loop.
loop was not vectorized: not inner loop.
LOOP WAS VECTORIZED.
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Get Your Code Vectorized by Intel Compiler
• Data Layout, AOS -> SOA
Array of Structures
• Data Alignment (next slide)
S0
X0
T0
• Make the loop innermost
S1
X1
T1
• Function call in treatment
…
…
…
– Inline yourself
– inline! Use __forceinline
– Define your own vector version
– Call vector math library - SVML
• Adopt jumpless algorithm
Structure of Arrays
S0
S1
…
X0
X1
…
S0
S1
…
• Read/Write is OK if it’s continuous
• Loop carried dependency
16
Not a true dependency
A true dependency
for(int i = TIMESTEPS; i > 0; i--)
#pragma simd
#pragma unroll(4)
for(int j = 0; j <= i - 1; j++)
cell[j]=puXDf*cell[j+1]+pdXDf*cell[j];
CallResult[opt] = (Basetype)cell[0];
for (j=1; j<MAX; j++)
a[j] = a[j] + c * a[j-n];
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Memory Alignment
• Allocated memory on heap
– _mm_malloc(int size, int aligned)
– scalable_aligned_malloc(int size, int aligned)
• Declarations memory:
– __attribute__((aligned(n))) float v1[];
– __declspec(align(n)) float v2[];
• Use this to notify compiler
– __assume_aligned(array, n);
• Natural boundary
– Unaligned access can fault the processor
Instruction
Length
Alignment
SSE
128 Bits
16 Bytes
AVX
256 Bits
32 Bytes
IMCI
512 Bits
64 Bytes
• Cacheline Boundary
– Frequently accessed data should be in 64
• 4K boundary
– Sequentially accessed large data should be in 4K boundary
17
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Vectorized C/C++ Runtime Functions
• Intel Compiler provide a set of vectorized runtimes function
– It’s free. You call them serially, compiler still can vectorize the code
• Multiple version of accuracy exists high medium and low
• Choose the right version by using –imf_precision=low
• Compiler with –S disassembly switches
• If any of these function can be inlined,
you should ask for it.
vmovaps
%zmm21, %zmm0
call
__svml_erff16_ep
• Use an advanced compiler witch
acos
ceil
fabs
round
acosh
cos
floor
sin
-fimf-precision=low
–fimf-domain_exclusion=31
Or
-fp-model fast=2
asin
cosh
fmax
sinh
asinh
erf
fmin
sqrt
atan
erfc
log
tan
atan2
erfinv
log10
tanh
atanh
exp
log2
trunc
cbrt
exp2
pow
18
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Lab Step 3 Vectorization
Vectorization of Monte Carlo European Options
• Identify the loop to be vectorized - remember Innermost loop
• Ensure alignment of Dynamically allocated memory
– Driver.cpp malloc(int size) -> _mm_malloc(size, align)
– Driver.cpp free() -> _mm_free()
• Self inline simple macro max
– float callValue=max(0.0,Sval*expf(MuByT+VBySqrtT*random[pos])-Xval);
Move to
– float callValue=Sval*expf(MuByT+VBySqrtT*random[pos])-Xval;
– callValue = (callValue > 0) ? callValue : 0
• Add Annotation
–
#pragma vector aligned
–
#pragma simd reduction(+:val) reduction(+:val2)
–
#pragma unroll(4)
• Makefile
– Add –xAVX –vec-report6 to your compiler invocation line
20
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Intel® Cilk™ Plus for Vectorization
Intel® Cilk™ Plus Technology - Elemental
Function
• Allow you to define data operations using scalar syntax
• Compiler apply the operation to data arrays in parallel, utilizing
both SIMD parallelism and core parallelism
Programmer
Intel Compile with Cilk Plus
Technology
1. Writes a standard C/C++ scalar syntax
2. Annotate it with __declspec(vector)
3. Use one of the parallel syntax choices
to invoke the function
1. Generates vector code with SIMD Instr.
2. Invokes the function iteratively, until all
elements are processed
3. Execute on a single core, or use the
task scheduler, execute on multicores
__declspec (vector)
double BlackScholesCall(double S,
double K,
double T)
{
double d1, d2, sqrtT = sqrt(T);
d1 = (log(S/K)+R*T)/(V*sqrtT)+0.5*V*sqrtT;
d2 = d1-(V*sqrtT);
return S*CND(d1) - K*exp(-R*T)*CND(d2);
}
Cilk_for (int i=0; i < NUM_OPTIONS; i++)
call[i] = BlackScholesCall(SList[i],
KList[i],
TList[i]);
22
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Intel® Cilk™ Plus Array Notation
•
C/C++ Language extension supported by the Intel® Compiler
•
Based on the concept of array-section notation:
<array>[<low_bound> : <len> : <stride>] [<low_bound> : <len> : <stride>]…
•
•
C/C++ Operators / Function Calls
–
d[:] = a[:] + (b[:] * c[:])
–
b[:] = exp(a[:]); // Call exp() on each element of a[]
Reductions combine array section elements to generate a scalar result
–
Nine built-in reduction functions supporting basic C data-types:
•
–
float a[10];
..
= c[][5];
add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero
Supports user-defined reduction function
•
Built-in reductions provide best performance
float a[10];
..
= a[:];
0 1 2 3 4 5 6 7 8 9
float a[10];
..
= a[2:6];
float a[10];
..
= d[0:3:2];
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Intel® C/C++ Vector Classes
Vector Classes: Boxed Intrinsic Data Types
• Intel® C/C++ Compiler provides C++ Classes that wrap vector
registers and vector intrinsic
– Class interface for native vector data types such as _mm512
– Class constructors use broadcast intrinsic functions
– overloaded operator for basic arithmetic and bitwise operations:
+-*/, &,! ^
– Provide transcendental functions interface – exp(a) wraps
__mm512_exp_ps(a)
– Defined reduction operations such as reduce_add(),
reduce_and(), reduce_min()
• Classes
– Intel® Xeon® Processor with SSE4.2 ISA: F32vec4 F64vec2
– Intel® Xeon® Processor that support AVX: F32vec8, F64vec4
– Intel® Xeon Phi™ Coprocessors: F32vec16, F64vec8, I32vec16,
Is32vec16, Iu32vec16, I64vec8
25
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Generic Computing with Vector Classes
•
The Intel Compiler provides vector classes for all SIMD lengths
– Support Intel® SSE2 and later 128-bit SIMD -- F32vec4, F64vec2
– Support Intel® AVX-based 256-bit SIMD -- F32vec8, F64vec4
– Support Intel® IMCI 512-bit SIMD -- F32vec16, F64vec8
•
Template Method definitions can abstract out SIMD class and length
– Create a template that takes a Vector Class , and fundamental type as inputs
• Instead of F32vec16 foo( F32vec16 a), only on Intel MIC architecture
• Try generic SIMDType foo_t<SIMDType, BasicType>(SIMDType a)
– Compiler creates a version of the template for each class the user instantiates
• int laneN = sizeof(SIMDType)/Sizeof(BaseType); // the num. of SIMD lanes
• int alignN = sizeof(SIMDType)/sizeof(char); // minimum SIMD alignment
• SIMDType Tvec = *(SIMDType*)&Tmem[0]; // read SIMD-full of data from Tmem
• *(SIMDType *)&(Tmem[0]) = Tvec; //write SIMD-full of data to Tmem, which point to BaseType
•
Benefit
– Same code template can create different binaries on different architectures
– Same code template for single precision and double precision
– Uses vector class constructor/methods for intrinsic function calls
26
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Summary
• Fill All the SIMD lane on using Compiler based
Vectorization technology
27
iXPTC 2013
Intel® Xeon Phi ™Coprocessor