Stepwise Optimization Framework Shuo Li Financial Services Group Software and Services Group Intel Corporation.

Download Report

Transcript Stepwise Optimization Framework Shuo Li Financial Services Group Software and Services Group Intel Corporation.

Stepwise Optimization Framework
Shuo Li
Financial Services Group
Software and Services Group
Intel Corporation
Legal Disclaimers
•
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
•
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
•
SPEC, SPECint, SPECfp, SPECrate, SPECjbb, SPECvirt_sc, and SPECpower_ssj are trademarks of the Standard Performance Evaluation Corporation. See
http://www.spec.org for more information.
•
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
•
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or
model. Any difference in system hardware or software design or configuration may affect actual performance.
•
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its
customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks
are accurate and reflect performance of systems available for purchase.
•
Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct
sequence. AES-NI is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see Intel®
Advanced Encryption Standard Instructions (AES-NI)
•
Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC
manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors
support HT Technology, visit http://www.intel.com/info/hyperthreading.
•
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology
performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system
delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
•
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with
Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched
environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security. In
addition, Intel TXT requires that the original equipment manufacturer provides TPM functionality, which requires a TPM-supported BIOS. TPM functionality
must be initialized and may not be available in all countries.
•
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, and virtual machine monitor (VMM). Functionality,
performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating
systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization.
•
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different
processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life
sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are
subject to change without notice
•
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s
current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice
•
Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel
Corporation
its subsidiaries
in the
Intel®
XeonorPhi™
Coprocessor
United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice.
2
iXPTC 2013
Agenda
• Programming Tools and programming Models
• Stepwise Optimization Framework
–
–
–
–
–
Step
Step
Step
Step
Step
1:
2:
3:
4:
5:
Leverage on Optimized Tools and Library
Scalar/Serial Optimization
Vectorization
Parallelization
Scale from Multicores to Manycores
• Case Studies
• Summary
3
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Programming Tools and
Programming Models
More Cores. Wider Vectors. Performance Delivered.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
More Cores
Multicore Many-core
50+ cores
Scaling
Performance
Efficiently
Serial
Performance
Wider Vectors
128 Bits
Task & Data
Parallel
Performance
• Comprehensive libraries
• Parallel programming models
256 Bits
512 Bits
• Industry-leading performance
from advanced compilers
Distributed
Performance
• Insightful analysis tools
iXPTC 2013
5
Intel® Xeon Phi™ Coprocessor
A Family of Parallel Programming Models
Developer Choice
Intel® Cilk™
Plus
C/C++
language
extensions to
simplify
parallelism
Intel®
Threading
Building
Blocks
Widely used
C++ template
library for
parallelism
DomainSpecific
Libraries
Intel®
Integrated
Performance
Primitives
Intel® Math
Kernel Library
Open sourced
Open sourced
Also an Intel
product
Also an Intel
product
Established
Standards
Research and
Development
Message
Passing
Interface (MPI)
Intel®
Concurrent
Collections
OpenMP*
(offload TR
coming “real
soon”)
Intel® Offload
Extensions
Coarray
Fortran
Intel® SPMD
Parallel
Compiler (ispc)
OpenCL*
Applicable to Multicore
andprogramming
Manycore
• Choice of high-performance
parallel
models
Programming
iXPTC 2013
6
Intel® Xeon Phi™ Coprocessor
Consistent Tools & Programming Models
High Performance Computing
Intel tools, libraries and parallel models extend from multicore to
many-core and back to optimize, parallelize and vectoriz
Compiler
Libraries
Parallel Models
Code
•
Multicore
Intel® Xeon
Processors
Manycore
Intel® Xeon
Processor
Cluster
Intel®
Xeon Phi™
Coprocessor
Intel® Xeon
Processors
Intel® Xeon
Processors &
Intel® Xeon Phi™
Coprocessors
Develop & Parallelize Today for Maximum
Performance
iXPTC 2013
7
Intel® Xeon Phi™ Coprocessor
Intel® Xeon Phi™ Coprocessors
Beyond Acceleration
Cluster Models
Intel®
Xeon®
Processor
Intel
Xeon
• Main ()
• MPI ()
• Func ()
MPI ranks only from Intel® Xeon Phi™
coprocessor cores. Single node or cluster. Ranks
are homogeneous --- Standard MPI, standard
compilers, standard tools.
Intel® Xeon
Phi™
coprocessor
• Main ()
• MPI ()
• Func ()
Intel Xeon
Phi
Coprocessor
• Main ()
• MPI ()
• Func ()
MPI ranks from processors and coprocessors.
Standard MPI, standard compilers, standard
tools. Single node or cluster. Ranks are
heterogeneous opening up new possibilities.
Off-load Model
Intel
Xeon
• Main ()
• MPI ()
• Func ()
Intel Xeon
Phi
Coprocessor
• Func ()
Serial code is run on the processor and parallel
code is moved to coprocessor for execution.
Language Extensions for Offload and x86
architecture offer significant improvements in
compute flexibility.
It’s your Code; It’s your Choice
iXPTC 2013
8
Intel® Xeon Phi™ Coprocessor
Software Development Ecosystem1 for Intel® Xeon
Phi™ Coprocessor
Open Source
Commercial
Compilers,
Run
environments
gcc
(kernel build only,
not for
applications),
python*
Intel® C++ Compiler, Intel® Fortran Compiler, MYO,
CAPS* HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS
GPI (Fraunhofer ITWM), ISPC
Debugger
gdb
Intel Debugger, RogueWave* TotalView* 8.9,
Allinea* DDT 3.3
Libraries
TBB2, MPICH2 1.5,
FFTW,
NetCDF
NAG*, Intel® Math Kernel Library, Intel® MPI
Library
OpenMP* (in Intel compilers), Intel® Cilk™ Plus (in
Intel compilers), Coarrray Fortran (Intel), Intel®
Integrated Performance Primitives, MAGMA,
Accelereyes ArrayFire 2.0 (Beta), Boost C++
Libraries 1.47+
Profiling &
Analysis Tools
Intel® VTune™ Amplifier XE, Intel® Trace Analyzer
& Collector,
Intel® Inspector XE, TAU – ParaTools 2.21.4
Virtualization
ScaleMP vSMP Foundation 5.0, Xen 4.1.2+
Cluster, Workload
Management, and
Manageability
Tools
Altair* PBS Professional 11.2, Adaptive* Computing
Moab 7.2, Bright Cluster Manager 6.1 (Beta), ET
International SWARM (Beta), IBM Platform
Computing {LSF 8.3, HPC 3.2 and PCM 3.2},
MPICH2, Univa Grid Engine 8.1.3
1These
are all announced as of November 2012. Intel has said there are more actively being developed but are not yet
announced. 2Commercial support of Intel TBB available from Intel.
4 Dec. 2012
Stepwise Optimization Framework
Stepwise Optimization Framework
A collection of methodology and tools that enable the developers
to express parallelism for Multicore and Manycore Computing
Objective: Turning unoptimized program into a scalable, highly
parallel application on multicore and manycore architecture
• Step 1: Leverage Optimized Tools, Library
• Step 2: Scalar, Serial Optimization
• Step 3: Vectorization
• Step 4: Parallelization
• Step 5: Scale from Multicore to Manycore
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 1: Leverage Optimized Tools,
Objective: Minimize the amount of development work, avoid
Library
reinventing the wheel
•
Use Optimizing Compiler
Level
Description
– Enable the targeted optimization -xAVX
"-O0"
No optimization
– Maximize compiler generated code
"-O1"
Optimization without code size increase
"-O2"
Most common optimization
– Use intrinsic as last resort
•
vectorization, loop unrolling
Use Optimized Library
function call inlining
"-O3"
– Math Kernel Library
Advanced optimization
loop fusion, interchanging
– Thread Building Blocks
Linear Algebra
•BLAS
•LAPACK
•Sparse solvers
•ScaLAPACK
Fast Fourier
Transforms
•Multidimensional
•FFTW interfaces
•Cluster FFT
cache blocking, loop splitting
Vector Math
•Trigonometric
•Hyperbolic
•Exponential,
Logarithmic
•Power / Root
•Rounding
Vector Random
Number
Generators
•Congruential
•Recursive
•Wichmann-Hill
•Mersenne Twister
•Sobol
•Neiderreiter
•RDRAND-based
Summary
Statistics
•Kurtosis
•Variation
coefficient
•Quantiles, order
statistics
•Min/max
•Variancecovariance
•…
Data Fitting
•Splines
•Interpolation
•Cell search
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 2: Scalar, Serial Optimization
Objective: Optimize core computation logic, Understand the scaling
potential of your application
• Choose and stay at right accuracy
– Intel MKL Accuracy Mode HA, LA, EP: vmlSetMode(VML_EP);
– Compiler: –fimf_precision=low,high,medium
• Choose and stay in right precision
– Type your constants: const NUM = 1.0f
– Use right function API exp for DP, expf() for SP
• Minimize the impact of Denormals
– Much higher cost to manycore –fp-modal fast=2
-fimf_domain_exclusion=15
– Calculate the Computer to Data Access Ratio
• Use EBS from Intel VTune Performance Analyzer
– CPU_CLK_UNHALTED/INSTRUCTION_ECECUTED
– Investigate is CPI per thread > 4
13
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 3 Vectorization
Objective: Fill up all SIMD Lanes, Full utilization of one Core.
• SIMD Parallelism Require data alignment
–
Convert the input from AOS to SOA
–
Memory declaration
__attribute__((aligned(64)) float a;
–
Memory allocation _mm_malloc(size, align)
–
TBB: scalable_aligned_malloc(size, align)
• Branch Breaks SIMD Execution
–
Conditional logic has to be masked at a cost
–
Functional calls can be hazardous
• Start with Compiler-based autovectorization
–
Provide hints on Alignment, Aliases,
Data Dependency
• Calculate the VPU usage ratio
–
VPU_ELEMENTS_ACTIVE/VPU_INSTRUCTION_EXECUTED
–
Investigate if ratio is < 8 for DP, < 16 in SP
Array Notation:
Intel® Cilk™ Plus
Compiler-based
autovectorization
Intel® Cilk Plus™
Elemental function
C++ Vector Classes
(F32vec16, F64vec8)
Vector intrinsics
(mm_add_ps, vaddps)
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 4: Parallelization
Objective: Keep all the cores and threads busy
• Partition the work at high level
Intel® Threading
Building Blocks
• Target Coarse granularity
• Manage thread creation overhead
• Minimize thread Synchronization
• Affinitize worker thread to processor threads
Intel® Cilk Plus™
OpenMP*
pthreads*
• Use Intel® Advisor XE – Thread Assistant
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 5: Scale from Multicore to Manycore
Objective: Take Parallel/Vectorized Program from 10s to 100s
threads
• Reduce the memory footprint to bare minimum
– Use registers and Caches wisely
– Inline function calls
– Recalculate
• Improve Data Affinity
– Memory allocation from the worker threads
• Block the data
– Improve Memory access efficiency
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Case Study
Black-Scholes: Workload Detail
d1 
d2 
ln(S
X
)  (r  v
2
2
)T
v T
ln(S
2
v
)  (r 
)T
X
2
 d1  v T
v T
c  SCND (d1 )  X e  rT CND (d 2 )
pX e
 rT
CND (d 2 )  SCND (d1 )
p  S e  rT  c  X e  rT
1 1
1
CND ( x)   ERF (
x)
2 2
2
18
S X T R V C P
S:
X:
T:
R:
V:
Current Stock price
Option strike price
Time to Expiry
Risk free interest rate
Volatility
c:
p:
European call
European put
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
float CND(float d)
{
const float A1 = 0.31938153;
const float A2 = -0.356563782;
const float A3 = 1.781477937;
const float A4 = -1.821255978;
const float A5 = 1.330274429;
const float RSQRT2PI =
0.39894228040143267793994605993438;
float
K = 1.0 / (1.0 + 0.2316419 * abs(d));
void BlackScholesCalc(
float& callResult,
float& putResult,
float S, float X, float T,
float R,
float V )
{
callResult
= putResult = 0;
float sqrtT = sqrt(T);
float d1 = (logf(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT);
float d2 = d1 - V * sqrtT;
float CNDD1 = CND(d1);
float CNDD2 = CND(d2);
float
cnd = RSQRT2PI * exp(- 0.5 * d * d) * (K
* (A1 + K * (A2 + K * (A3 + K * (A4 + K *
A5)))));
float expRT = expf(- R * T);
callResult += S * CNDD1 - X * expRT * CNDD2;
putResult += X * expRT * (1.0 - CNDD2) - S * (1.0 - CNDD1);
}
if(d > 0)
cnd = 1.0 - cnd;
return cnd;
Million Opt/sec
}
6
5
4
3
2
1
0
Baseline Code
Stepwise Optimization
•
19
Compiled with GCC
gcc –o bs_step0 –O2 bs_step0.cpp
void
BlackScholesReference(
float *CallResult,
float *PutResult,
float *StockPrice,
float *OptionStrike,
float *OptionYears,
float Riskfree,
float Volatility,
int optN
)
{
for(int i = 0; i < REPETITION; i++)
for(int j = 0; j < DATASIZE; j++)
BlackScholesCalc(CallResult[j],
PutResult[j],
StockPrice[j],
OptionStrike[j],
OptionYears[j],
Riskfree,
Volatility);
}
Configurations: Dual socket server system with two 2.6 GHz Intel® Xeon™ processor E5-2670
32GB, 8 x 4GB DDR3-1600MHz. GCC version 4.4.6.
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 1: Leverage on Optimized Tools, Library
•
Use Intel® Parallel Composer XE 2013
icc –o bs_step1 –O2 bs_step1.cpp
•
Use Intel C Runtime Library libm
– erf(x) is error function related to cndf(x)
•
Performance: from 5.37 to 22.75 million opts/sec;
improvement 4.23X
float CND(float d)
{
return HALF + HALF*erff(M_SQRT1_2*d);
}
Million Options/sec
25
20
15
10
5
0
Baseline Code
Step 1 ComposeXE
2013
Axis Title
Configurations: Dual socket server system with two 2.6 GHz Intel® Xeon™ processor E5-2670
32GB, 8 x 4GB DDR3-1600MHz. GCC version 4.4.6.
Step 2: Scalar, Serial Optimization
Focus on the inner loop
•
– Factor out loop invariant code
– Take advantage of put-call parity
Performance: 22.75 to 39.15 million opts/sec;
improvement: 1.72X
•
Million Options/sec
40
35
30
25
20
15
10
5
0
Baseline Code
Step 1
ComposeXE
2013
Step 2 Scalar,
Serial Opt
Stepwise Optimization
void BlackScholesCalc(
float& callResult,
float& putResult,
float S,
float X,
float T,
float R,
float V )
{
float sqrtT = sqrtf(T);
float VsqrtT = V*sqrtT;
float d1 = logf(S / X)/VsqrtT + RVV * sqrtT;
float d2 = d1 - VsqrtT;
float CNDD1 = CND(d1);
float CNDD2 = CND(d2);
float XexpRT = X*expf(- R * T);
callResult = S * CNDD1 - XexpRT * CNDD2;
putResult = callResult + XexpRT - S;
}
Step 3: Vectorization
•
Compiler based autovectorization
–
•
{
int msize = sizeof(float) * DATASIZE;
CallResult = (float *)_mm_malloc(msize, 64);
PutResult = (float *)_mm_malloc(msize, 64);
StockPrice
= (float *)_mm_malloc(msize, 64);
OptionStrike = (float *)_mm_malloc(msize, 64);
OptionYears
= (float *)_mm_malloc(msize, 64);
Mark #pragma simd in inner loop
Aligned memory allocation
BlackScholesReference ( …);
Inline function calls
•
Performance: 39.15 to 220.51 million
opt/sec
improvement: 5.63X
Million Options/sec
•
250
200
150
100
50
0
Stepwise Optimization
_mm_free(CallResult);
_mm_free(PutResult);
_mm_free(StockPrice);
_mm_free(OptionStrike);
_mm_free(OptionYears);
}
#pragma simd
for(int j = 0; j < DATASIZE; j++)
{
float T = OptionYears[j];
float X = OptionStrike[j];
float S = StockPrice[j];
float sqrtT = sqrtf(T);
float VsqrtT = VOLATILITY*sqrtT;
float d1 = logf(S / X)/VsqrtT + RVV * sqrtT;
float d2 = d1 - VsqrtT;
float CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1);
float CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2);
float XexpRT = X*expf(-RISKFREE * T);
float callResult = S * CNDD1 - XexpRT * CNDD2;
CallResult[j] = callResult;
PutResult[j] = callResult + XexpRT - S;
}
iXPTC 2013
22
Intel® Xeon Phi™ Coprocessor
Step 4 Parallelization
• Use OpenMP* from Intel
Composer XE 2013
Million Options/sec
• Parallelize outer loop,
distribute the data to each
thread
4000
3500
3000
2500
2000
1500
1000
500
0
kmp_set_defaults("KMP_AFFINITY=scatter,granularity=thread");
#pragma omp parallel
for(int i = 0; i < REPETITION; i++)
#pragma omp for
#pragma simd
for(int j = 0; j < DATASIZE; j++)
{
float T = OptionYears[j];
float X = OptionStrike[j];
float S = StockPrice[j];
float sqrtT = sqrtf(T);
float VsqrtT = VOLATILITY*sqrtT;
float d1 = logf(S / X)/VsqrtT + RVV * sqrtT;
float d2 = d1 - VsqrtT;
float CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1);
float CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2);
float XexpRT = X*expf(-RISKFREE * T);
float callResult = S * CNDD1 - XexpRT * CNDD2;
CallResult[j] = callResult;
PutResult[j] = callResult + XexpRT - S;
}
• Add –openmp to the compiler
invocation line
• Performance: 220.51 to
3,849.49 million opt/sec;
improvement 17.46X
Stepwise Optimization
iXPTC 2013
23
Intel® Xeon Phi™ Coprocessor
Step 5 Scale from Multicore to Manycore
•
•
NUMA friendly memory allocation
–
scalable_aligned_malloc(size, align);
–
scalable_aglined_free();
Affinitize the OpenMP worker threads
–
KMP_AFFINITY=“compact, granularity=fine”
•
Data Blocking
•
Streaming Write
•
Optimize exp/log
exp(x) = exp2(x*M_LOG2E)
ln(x) = log2(x)*M_LN2
Configurations: Intel Xeon Phi SE10 61 core 1.1 GHz 8GB GDDR5.5 GB/s
24
for (int chunkBase = 0; chunkBase < OptPerThread; chunkBase +=
CHUNKSIZE)
{
#pragma simd vectorlength(CHUNKSIZE)
#pragma simd
#pragma vector aligned
#pragma vector nontemporal (CallResult, PutResult)
for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++)
{
float CNDD1;
float CNDD2;
float CallVal =0.0f, PutVal = 0.0f;
float T = OptionYears[opt];
float X = OptionStrike[opt];
float S = StockPrice[opt];
float sqrtT = sqrtf(T);
float d1 = log2f(S / X) / (VLOG2E * sqrtT) + RVV *
sqrtT;
float d2 = d1 - VOLATILITY * sqrtT;
CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1);
CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2);
float XexpRT = X*exp2f(RLOG2E * T);
CallVal = S * CNDD1 - XexpRT * CNDD2;
PutVal = CallVal + XexpRT - S;
CallResult[opt] = CallVal ;
PutResult[opt] = PutVal ;
}
Intel® Xeon Phi™ Coprocessor
}
iXPTC 2013
Build Native Intel® Xeon Phi™ Application
Multicore:
icpc -o bs_step4 -xAVX -openmp -ltbbmalloc bs_step4.c
Manycore:
icpc -o bs_step5.mic -mmic -openmp –ltbbmaloc bs_step5.c
sudo scp bs_step5.mic mic0:/tmp
•
Connect to Intel Xeon Phi:
sudo ssh mic0
root> cd /tmp
•
Environment variables:
20000
export LD_LIBRARY_PATH=.
export NUM_THREADS_NUM=240
export KMP_AFFINITY=compact
15000
Invoke the program
root>./bs_step5.mic
Intel® Xeon® Processor
–
•
25000
20,259.84
10000
5000
3,849.49
5.377
22.75
39.15
220.51
0
3,859 million opt/sec
Intel® Xeon Phi™ Coprocessor
–
20,259.84 million opt/sec
iXPTC 2013
25
Intel® Xeon Phi™ Coprocessor
Summary
•
Intel® Xeon Phi™ Coprocessor 5100 is the first
manycore product based on Intel Architecture. You can
order it now!
•
Explore the similarity between the Multicore and
Manycore products from Intel.
•
Forward scale your application using Intel® Parallel
Studio XE 2013 and Stepwise Optimization Framework
iXPTC 2013
26
Intel® Xeon Phi™ Coprocessor
Option Pricing using Intel® Xeon Phi™ Coprocessor
an Offload Example
Demo Scenario 3:
•
Same Optimized code
•
Binomial on Intel Multicore
•
Monte Carlo on Intel Manycore
•
Pragma based offload syntax
Demo Scenario 2:
•
•
Same workload
•
Optimized code
•
Intel Multicore Platform
•
Intel Parallel Studio XE 2013
Demo Scenario 1:
•
28
•
–
Baseline Code
–
Binomial and Monte Carlo
–
Intel Multicore platform
–
GCC 4.4.6
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
template <class SIMDType, class Basetype>
void BinomialTemplate(
Basetype *h_CallResult,
Basetype *S,
Basetype *X,
Basetype *T,
Basetype Riskfree,
Basetype Volatility,
int optN,
int timestepN)
{
int SIMDLEN = sizeof(SIMDType)/sizeof(Basetype);
int SIMDALIGN = sizeof(SIMDType)/sizeof(char);
printf("Binomial Option pricing: %d otions in %d time steps...\n", optN, timestepN);
int opt;
#pragma omp parallel for
for(opt = 0; opt < optN; opt++) {
__declspec(align(16)) Basetype Call[timestepN + 1];
const Basetype
Sx = S[opt];
const Basetype
Xx = X[opt];
const Basetype
const Basetype
Tx = T[opt];
dt = Tx / static_cast<Basetype>(timestepN);
const Basetype
vDt = Volatility * sqrtf(dt);
const Basetype
rDt = Riskfree * dt;
const Basetype
If = expf(rDt);
const Basetype
Df = expf(-rDt);
const Basetype
u = expf(vDt);
const Basetype
d = expf(-vDt);
const Basetype
const Basetype
pu = (If - d) / (u - d);
pd = 1.0f - pu;
const Basetype
puByDf = pu * Df;
const Basetype
pdByDf = pd * Df;
for(int i = 0; i <= timestepN; i++)
{
Basetype d = Sx * expf(vDt * (2.0 * i - timestepN)) - Xx;
Call[i] =
(d > 0) ? d : 0;
}
for(int i = timestepN; i > 0; i--)
for(int j = 0; j <= i - 1; j++)
Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j];
h_CallResult[opt] = (Basetype)Call[0];
}
}
29
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Generic Programming and Offloading to
Intel® Xeon Phi™ Coprocessor
#pragma omp section
#pragma offload if (use_coprocessor) target(mic:0) in(S, X, T:length(OPT_N))\
out(callValueMICExpected, callValueMICConfidence:length(OPT_N))
{
montecarlo(S, X, T, callValueMICExpected, callValueMICConfidence);
}
__declspec(target(mic))
void montecarlo(float *S,
float
float
float
float
{
*X,
*T,
*callValueMICExpected,
*callValueMICConfidence)
#ifdef __MIC__
printf("MonteCarlo Options Pricing running on Intel Xeon Phi Coprocessor.\n");
omp_set_num_threads(240);
MonteCarloTemplate<F32vec16, float>(callValueMICExpected,
callValueMICConfidence,
S, X, T, R, V, OPT_N, PATH_N);
#else
printf("MonteCarlo Options Pricing running on Intel Xeon Processor.\n");
fflush(stdout);
MonteCarloTemplate<F32vec8 , float>(callValueMICExpected,
callValueMICConfidence,
S, X, T, R, V, OPT_N, PATH_N);
#endif
}
30
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Performance Summary of the Demo
GCC Compiled Application
Time Saved by using Intel Parallel Studio XE
Paralle Xeon Application
Time Saved by using Intel Xeon Phi Coprocessor
Parallel Xeon and Xeon Phi Application
0
200
400
600
800
1000
1200
• Intel® Parallel Composer XE 2013 and Stepwise
Optimization Framework removed 95% of “wall clock
time” in GCC code.
• Running Binomial Option on Intel® Xeon® Processor and
iXPTC
2013
offloading
Monte
Carlo
method
to
Intel®
Xeon
Phi™
Intel® Xeon Phi™ Coprocessor
31
Conclusions
• Intel® Xeon Phi™ coprocessor is real
– You can order it now for delivery Jan 28
• It is a coprocessor, not just an accelerator
– Program it with open standards
• C, C++, Fortran, OpenCL, OpenMP, TBB, Cilk
• The same optimizations help the Intel®
Xeon® processor as well as the Intel Xeon
Phi™ coprocessor (tuning effort isn’t wasted!)
iXPTC 2013
32
Intel® Xeon Phi™ Coprocessor
Resources
• Intel Developer Zone
• http://software.intel.com
/mic-developer
(Intel and 3rd Party)
• Community/Discussion
• Training
• Case Studies (just
starting)
33
4 Dec. 2012
UKMAC 2012
Structured Parallel Programming
using Intel® Threading Building Blocks and
Intel® Cilk™ Plus
• Teaching structured
parallel programming
• Designed for
programmers not
computer architects
• Teach best methods
(known as patterns)
www.parallelbook.com
iXPTC 2013
34
Intel® Xeon Phi™ Coprocessor
Resources
Parallel Programming Community
Intel® Many Integrated Core (MIC) Architecture Forum
35
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Q&A
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS
ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
• A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal
injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL
INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND
EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES
ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY
WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE
DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
• Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the
absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition
and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here
is subject to change without notice. Do not finalize a design with this information.
• The products described in this document may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
• Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel
representative to obtain Intel's current plan of record product roadmaps.
• Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not
across different processor families. Go to: http://www.intel.com/products/processor_number.
• Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
• Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by
calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
• Sandy Bridge, Ivy Bridge, Haswell and other code names featured are used internally within Intel to identify products that are in
development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to
use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at
the sole risk of the user
• Intel, Xeon, Xeon Phi, VTune, Cilk, Sponsors of Tomorrow and the Intel logo are trademarks of Intel Corporation in the United States
and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright ©2012 Intel Corporation.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with
Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
iXPTC 2013
38
Intel® Xeon Phi™ Coprocessor
Legal Disclaimer
•
•
•
•
•
•
•
Intel® 64 architecture requires a system with a 64-bit enabled processor, chipset, BIOS and
software. Performance will vary depending on the specific hardware and software you use. Consult your
PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t
Intel® Trusted Execution Technology (Intel® TXT): No computer system can provide absolute security
under all conditions. Intel® TXT requires a computer with Intel® Virtualization Technology, an Intel TXT
enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT compatible measured
launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more
information, visit http://www.intel.com/technology/security
Intel® Virtualization Technology (Intel® VT) requires a computer system with an enabled Intel®
processor, BIOS, and virtual machine monitor (VMM). Functionality, performance or other benefits will
vary depending on hardware and software configurations. Software applications may not be compatible
with all operating systems. Consult your PC manufacturer. For more information, visit
http://www.intel.com/go/virtualization
Intel® Turbo Boost Technology requires a system with Intel Turbo Boost Technology. Intel Turbo Boost
Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult
your PC manufacturer. Performance varies depending on hardware, software, and system
configuration. For more information, visit http://www.intel.com/go/turbo
Built-In Security: No computer system can provide absolute security under all conditions. Built-in
security features available on select Intel® Core™ processors may require additional software, hardware,
services and/or an Internet connection. Results may vary depending upon configuration. Consult your PC
manufacturer for more details.
Enhanced Intel SpeedStep® Technology - See the Processor Spec Finder at http://ark.intel.com or
contact your Intel representative for more information.
Intel® Hyper-Threading Technology (Intel® HT Technology) is available on select Intel® Core™
processors. Requires an Intel® HT Technology-enabled system. Consult your PC
manufacturer. Performance will vary depending on the specific hardware and software used. For more
information including details on which processors support Intel HT Technology, visit
http://www.intel.com/info/hyperthreading.
Legal Disclaimer
• Other Software Code Disclaimer
•
Permission is hereby granted, free of charge, to any person obtaining a copy of this software
•
•
and associated documentation files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice (including the next paragraph) shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Risk Factors
The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and
the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,”
“intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements.
Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements.
Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause
actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following
to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be
different from Intel's expectations due to factors including changes in business and economic conditions, including supply
constraints and other disruptions affecting customers; customer acceptance of Intel’s and competitors’ products; changes in
customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global
economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative
financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive
industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product
demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel
product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including
product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s
ability to respond quickly to technological developments and to incorporate new features into its products. Intel is in the process of
transitioning to its next generation of products on 22nm process technology, and there could be execution and timing issues
associated with these changes, including products defects and errata and lower than anticipated manufacturing yields. The gross
margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation,
including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the
timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit
costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of longlived assets, including manufacturing, assembly/test and intangible assets. The majority of Intel’s non-marketable equity
investment portfolio balance is concentrated in companies in the flash memory market segment, and declines in this market
segment or changes in management’s plans with respect to Intel’s investments in this market segment could result in significant
impairment charges, impacting restructuring charges as well as gains/losses on equity investments and interest and other. Intel's
results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its
customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions,
health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses,
as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of
revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be
affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or
regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the
litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an
injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting
Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed
discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most
recent Form 10-Q, Form 10-K and earnings release.
Rev. 5/4/12
Section Title
Debugging on Intel Xeon Phi
Coprocessors
Shuo Li/Financial Service Engineering
Overview