Leveraging Optimized Tools and Libraries Shuo Li Financial Services Engineering Software and Services Group Intel Corporation.

Download Report

Transcript Leveraging Optimized Tools and Libraries Shuo Li Financial Services Engineering Software and Services Group Intel Corporation.

Leveraging Optimized Tools and
Libraries
Shuo Li
Financial Services Engineering
Software and Services Group
Intel Corporation
Agenda
• Lab Step 1 Baseline
• Intel® Parallel Studio XE 2013
• Lab Step 1 Using Intel Compiler
• Intel® MKL
• Lab Step 1 Using Intel Compiler and MKL
• Summary
2
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Lab Step 1 Baseline
Monte Carlo European Option Pricing
Monte Carlo
Method?
Statistical Computing
method pioneered by
Nicholas Metropolis
Monte Carlo in
Finance
Phelim Boyle introduced
Monte Carlo method to
Quantitative Finance
4
• Simple and Repetitive
algorithms
• Central Limit Theorem
1. Sample a random path for S in a risk
neutral world
2. Calculate the payoff from the derivative
3. Repeat steps 1 and 2 to get many sample
values of the payoff from the derivative in
a risk-neutral world.
4. Calculate the mean of the sample payoff
5. Discount expected payoff at risk-free rate
to get an estimate of the value of the
option
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Initial Implementation with GCC
• Use GCC 4.4.6
typedef std::tr1::mt19937
ENG; // Mersenne Twister
typedef std::tr1::normal_distribution<float> DIST;
typedef std::tr1::variate_generator<ENG,DIST> GEN;
• C/C++ TR1 Random number
generator
ENG
• Program Files
Driver.cpp
Main program file
MonteCarlo.h
Parameter Definitions
MonteCarloStepn.cpp
Monte Carlo Calculations
Makefile
Build file
5
eng;
DIST dist(0,1);
GEN gen(eng,dist);
for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VOLATILITY * sqrt(T[opt]);
float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) *
T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0, val2 = 0.0;
for(int pos = 0; pos < RAND_N; pos++)
{
float callValue = max(0.0, Sval *exp(MuByT + VBySqrtT *
gen()) - Xval);
val += callValue;
val2 += callValue * callValue;
}
float exprt = exp(-RISKFREE *T[opt]);
CallResult[opt] = exprt * val / (float)RAND_N;
float stdDev = sqrt(((float)RAND_N * val2 - val * val)/
((float)RAND_N * (float)(RAND_N - 1)));
CallConfidence[opt] = (float)(exprt * 1.96 * stdDev /
sqrtf((float)RAND_N));
}
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Your Mission: Make it Faster and Better
• Make it fast on Intel® Xeon® Processor and
Even faster on Intel® Xeon Phi™ Coprocessor
• Take the full advantage the hardware resource
• Tools: Intel Parallel Studio XE 2013 SP1
– Intel® C/C++ Compiler
– Intel® MKL
• Methodology: Stepwise Optimization Framework
Let’s Get Started by typing “make”
6
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® Parallel Studio XE 2013
• Helping Developers Efficiently Produce Fast,
Scalable and Reliable Applications
More Cores. Wider Vectors. Performance Delivered.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
More Cores
Scaling
Performance
Efficiently
Multicore Many-core
50+ cores
Wider Vectors
128 Bits
Serial
Performance
Task & Data
Parallel
Performance
• Industry-leading
performance from advanced
compilers
• Comprehensive libraries
256 Bits
512 Bits
8
Distributed
Performance
• Parallel programming models
• Insightful analysis tools
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
What’s New?
Intel®
Parallel
Studio
XE
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013
• Performance Leadership:
– 3rd Generation Intel® Core™ Processors (code name
“Ivy Bridge”) and future Intel® processors
(code name “Haswell”)
– Intel® Xeon Phi™ coprocessors
– Improved C++ and Fortran performance
New Product Capabilities
– Latest OS: Windows* 8 Desktop, Linux*
– IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain
– Standards: C99, selected C++11 features, almost
complete Fortran 2003 support and selected features
from Fortran 2008, Fortran 2008, MPI 2.2
9
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Support for Latest Intel
Processors and Coprocessors
Intel® Ivy Bridge
microarchitecture
Intel® Haswell
microarchitecture
Intel® Xeon Phi™
coprocessor
✔
AVX
✔
AVX2, FMA3
✔
IMCI
Intel® TBB library
✔
✔
✔
Intel® MKL library
✔
AVX
✔
AVX2, FMA3
✔
Intel® MPI library
✔
✔
✔
Intel® VTune™ Amplifier XE†
✔
Hardware Events
✔
Hardware Events
✔
Hardware Events
Intel®
✔
Memory & Thread
Checks
✔
Memory & Thread
✔
Memory & Thread††
Intel® C++ and Fortran Compiler
10
Inspector XE
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Performance-Oriented Compiler Suites
Intel® Compilers, Performance Libraries, Debugging Tools
On Windows, Linux and Mac OS X
Intel® C++
Composer XE 2013
• Intel® C++ Compiler XE 13.0
with Intel® Cilk™ Plus
• Intel® TBB
• Intel® MKL
• Intel® IPP
• Intel® Xeon Phi™ product
family support, Linux
Intel Composer
XE 2013
Intel® Fortran
Composer XE 2013
• Intel® Fortran Compiler XE 13.0
• Intel® MKL
• Compatibility with Compaq
Visual Fortran*
• Fortran 2003, 2008 support
• Intel® Xeon Phi™ product
family support, Linux
• Combines Intel C++
Composer XE and Intel®
Fortran Composer XE
• For Fortran developers who
also want Intel C++
• Windows (requires Visual
Studio) and Linux only
Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio*
Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT
Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment
All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran
All: Leadership performance on Intel and compatible architectures
All: One Year Intel® Premier Support. Renewable Annually.
Performance, Compatibility, Support
11
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Superior C++ Compiler Performance
More Performance
•
•
•
•
12
Just recompile
Uses Intel® AVX and Intel® AVX2 instructions
Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux)
Intel® Cilk™ Plus: Tasking and vectorization
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Lab Step 1: Using Intel Compiler
Build Monte Carlo European Options
using Intel C/C++ Compiler
• Intel Compiler is fully compatible with
• Intel Parallel Studio XE 2013 installed on your notebook
• Just type icpc –V for test
• Source environmental variables at
. /opt/intel/composerxe/pkg_bin/compilervars.sh intel64
• Reissue the make command with CXX=icpc
make CXX=icpc
• Rerun MonteCarlo built by Intel® C/C++ Composer XE
./MonteCarlo
14
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL
16
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL Supports Intel® Xeon Phi™ Coprocessors
• Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessors.
• Heterogeneous computing
• Takes advantage of both multicore host and many-core
coprocessors.
• Optimized for wider (512-bit) SIMD instructions and threaded
for many cores.
• All Intel MKL functions are supported:
• But optimized at different levels.
Pairing highly parallel software
with highly parallel hardware.
17
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Highly Optimized Functions
• As of MKL 11.0 Update 2 (the latest):
– BLAS Level 3, and much of Level 1 & 2
– Sparse BLAS: ?CSRMV, ?CSRMM
– Some important LAPACK routines (LU, QR, Cholesky)
– Fast Fourier transforms
– Vector Math Library
– Random number generators in the Vector Statistical
Library
• Broader functionality to be optimized in future
update releases.
18
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Usage Models on Intel® Xeon Phi™ coprocessors
• Automatic Offload
• No code changes required
• Automatically uses both host and target
• Transparent data transfer and execution management
• Compiler Assisted Offload
• Explicit controls of data transfer and remote execution
using compiler offload pragmas/directives
• Can be used together with Automatic Offload
• Native Execution
• Uses the coprocessors as independent nodes
• Input data and binaries are copied to targets in advance
19
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Lab 1 Using Intel Compile and MKL
Using Intel® MKL Random Number Generation
• Include MKL header file #include <mkl_vsl.h>
• Declare a buffer to receive random numbers
float random[RANd_N];
• Define a random stream descriptor data structure
VSLSTREAMSTATEPTR Randomstream
• Create and initialize the random streams
vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEE)
• Receive the Random Number in the buffer
vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF,
Randomstream, RAND_N, random, 0.0, 1.0);
• Add –mkl in your linker options link and remake the rerun
MonteCarlo
• Record the performance number in the Excel Worksheet
21
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Summary
Summary
• Using Intel® Compiler for high performance
• Use Intel® MKL to accelerate Monte Carlo.
23
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Backup
25
26
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL Supports for Intel® Xeon Phi™
Coprocessors
• Intel® MKL 11.0 supports the Intel® Xeon Phi™
coprocessors.
• Heterogeneous computing
• Takes advantage of both multicore host and many-core
coprocessors.
• Optimized for wider (512-bit) SIMD instructions and
threaded for many cores.
• All Intel MKL functions are supported:
• But optimized at different levels.
Pairing highly parallel software
with highly parallel hardware.
27
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Highly Optimized Functions
– As of MKL 11.0 Update 2 (the latest):
• BLAS Level 3, and much of Level 1 & 2
• Sparse BLAS: ?CSRMV, ?CSRMM
• Some important LAPACK routines (LU, QR, Cholesky)
• Fast Fourier transforms
• Vector Math Library
• Random number generators in the Vector Statistical Library
– Broader functionality to be optimized in future update
releases.
28
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Usage Models on Intel® Xeon Phi™
Coprocessors
• Automatic Offload
• No code changes required
• Automatically uses both host and target
• Transparent data transfer and execution management
• Compiler Assisted Offload
• Explicit controls of data transfer and remote execution
using compiler offload pragmas/directives
• Can be used together with Automatic Offload
• Native Execution
• Uses the coprocessors as independent nodes
• Input data and binaries are copied to targets in advance
29
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Automatic Offload (AO)
• Offloading is automatic and transparent.
• Can take advantage of multiple coprocessors.
• By default, Intel MKL decides:
• When to offload
• Work division between host and targets
• Users enjoy host and target parallelism
automatically.
• Users can still specify work division between host
and target. (for BLAS only)
30
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Automatic Offload
• Using Automatic Offload is easy
Set an env variable:
Call a function:
or
mkl_mic_enable()
MKL_MIC_ENABLE=1
• What if there doesn’t exist a coprocessor in the system?
• Runs on the host as usual without penalty!
31
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Automatic Offload Enabled Functions
• A selected set of MKL functions are AO enabled.
• Only functions with sufficient computation to offset data
transfer overhead are subject to AO
• In 11.0.2, AO enabled functions include:
• Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM, ?SYMM
• LAPACK 3 amigos: LU, QR, Cholesky
• Offloading happens only when matrix sizes are right
• ?GEMM: M, N > 2048, K > 256
• ?SYMM: M, N > 2048
• ?TRSM/?TRMM: M, N > 3072
32
• LU: M, N > 8192
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Work Division Control in Automatic Offload
Examples
Notes
mkl_mic_set_Workdivision(
MKL_TARGET_MIC, 0, 0.5)
Offload 50% of computation only to the 1st
card.
Examples
Notes
MKL_MIC_0_WORKDIVISION=0.5
Offload 50% of computation only to the 1st
card.
Work division settings have no
effects for LAPACK functions.
33
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Compiler Assisted Offload (CAO)
• Offloading is explicitly controlled by compiler
pragmas or directives.
• All MKL functions can be offloaded in CAO.
• In comparison, only a subset of MKL is subject to AO.
• Can leverage the full potential of compiler’s
offloading facility.
• More flexibility in data transfer and remote
execution management.
• A big advantage is data persistence: Reusing transferred
data for multiple operations.
34
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Compiler Assisted Offload
• The same way you would offload any function
call to the coprocessor.
• An example in C:
#pragma offload target(mic) \
in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
in(C:length(matrix_elements)) \
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
35
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Compiler Assisted Offload
• An example in Fortran:
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
!$OMP PARALLEL SECTIONS
!$OMP SECTION
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )
!$OMP END PARALLEL SECTIONS
36
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Using AO and CAO in the Same Program
• Users can use AO for some MKL calls and use
CAO for others in the same program
• Only supported by Intel compilers.
• Work division must be set explicitly for AO.
• Otherwise, all MKL AO calls are executed on the host.
37
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Native Execution
• Use the coprocessor as an independent compute
node.
• Programs can be built to run only on the coprocessor by using
the –mmic build option.
– MKL function calls inside an offloaded code
region executes natively.
– Better performance if input data is already available on the
coprocessor, and output is not immediately needed on the host side.
38
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Considerations of Using Intel® MKL on
Intel® Xeon Phi™ Coprocessors
High level parallelism is critical in maximizing
performance.
• BLAS (Level 3) and LAPACK with large problem size get the
most benefit.
• Scaling beyond 100’s threads, vectorized, good data locality
Minimize data transfer overhead when offload.
• Offset data transfer overhead with enough computation.
• Exploit data persistence: CAO to help!
You can always run on the host if offloading does
not offer better performance.
39
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Value of Suites
Suite Only Features
• Advisor XE
Parallelism Advice
• C++ Performance Guide
Performance Wizard
• Pointer Checker
Reduces memory corruption
• Code Complexity Analysis
Find code likely to be less
reliable
• Static Analysis Improved!
Find Errors and Harden your
Security
Optimization Notice
41
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler
s&
Libraries
What’s New in Libraries?
Intel® MKL
• Digital random number generator (DRNG) for improved vector statistics
calculations
• Automatically utilize Intel® Xeon Phi™ Coprocessors and balance
compute loads between CPUs and coprocessors
Intel® IPP
• Enhanced image resize performance primitives
• Improved IPP footprint size
Intel® TBB
"Intel® TBB provided us with
optimized code that we did not have
to develop or maintain for critical
system services. I could assign my
developers to code what we bring to
the software table—crowd simulation
software.”
•
Improved usability and reliability of the Flow Graph feature
•
Additional C++11 Support
Michaël Rouillé, CTO, Gol
Ready to Use Libraries to Increase Performance
Optimization Notice
42
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.