Parallelization Shuo Li Financial Services Engineering Software and Service Group Intel Corporation Agenda • Parallelism on Intel® Architecture • Challenges in Parallelization • Options for Parallelization • Summary iXPTC.

Download Report

Transcript Parallelization Shuo Li Financial Services Engineering Software and Service Group Intel Corporation Agenda • Parallelism on Intel® Architecture • Challenges in Parallelization • Options for Parallelization • Summary iXPTC.

Parallelization
Shuo Li
Financial Services Engineering
Software and Service Group
Intel Corporation
Agenda
• Parallelism on Intel® Architecture
• Challenges in Parallelization
• Options for Parallelization
• Summary
2
iXPTC 2013
Intel® Xeon Phi
Phi™
™Coprocessor
Coprocessor
Parallelism on Intel® Architecture
Parallelism on Intel® Architecture
Images do not reflect actual die sizes. Actual production die may differ from images.
Intel®
Xeon®
processor
Intel Xeon
processor
64-bit
5100
series
Core(s)
1
2
Threads
2
2
Intel Xeon
processor
5500
series
Intel Xeon
processor
Intel Xeon
processor
Intel Xeon
processor
5600
series
E5
Product
Family
4
6
8
10
8
12
16
20
code name
Ivy
Bridge
Intel Xeon
processor
code name
Haswell
To be
Anno
unced
Intel®
Xeon Phi™
Coprocessor
61
244
Intel® Xeon Phi™ Coprocessor extends established CPU architecture
and programming paradigm to highly parallel applications
4
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Options for Parallelism
Options for Parallelism on Intel® Architecture
C++ template Library of parallel
algorithms, containers
Load balancing via work stealing
Keyword extension of C/C++, Serial
equivalence via compiler
Load balancing via work stealing
Intel® TBB
Ease of use
maintainability
Intel® Cilk™ Plus
Well known industry standard
Best suited when resource utilization is
known at design time
OpenMP*
Time-tested industry standard for Unix-like
Common denominator or other high level
threading libraries
pthreads*
More control
• What’s available on Intel® host processor are also available on
Intel® target coprocessor
• Many others (boost, zthreads) are ported to the coprocessor
• Choose the best threading model your problem dictates
iXPTC 2013
6
Intel® Xeon Phi™ Coprocessor
Options for Parallelism – pthreads*
• POSIX* Standard for thread API with 20 years history
• Foundation for other high level threading libraries
• Independently exist on the host and Intel® MIC
• No extension to go from the host to Intel® MIC
• Advantage: Programmer has explicit control
– From workload partition to thread creation, synchronization,
load balance, affinity settings, etc.
• Disadvantage: Programmer has too much control
– Code longevity
– Maintainability
– Scalability
iXPTC 2013
7
Intel® Xeon Phi™ Coprocessor
Black-Scholes
using pthreads*
pthread_attr_init(&attr);
clock_gettime(0, &t0);
for (int i = 0; i < nthreads; i++)
{
int t = 4 * (i / SMT) + (i % SMT);
set_thread_affinity_attr(t, &attr);
pthread_create(&threads[i], 0, bs_thread, (void
*) i);
}
for(i=0; i<nThreads; i++) {
int ret;
pthread_join(threads[i], (void **)&ret);
}
clock_gettime(0, &t1);
__forceinline
void BlkSchlsEqEuroNoDiv_C(fptype * OptionPrice,
fptype * OptionPrice2, fptype * sptprice,
fptype * strike, fptype * time)
{
int i;
fptype sqrtT;
fptype d1, d2;
fptype NofXd1, NofXd2;
sqrtT = SQRT(*time);
d1 = LOG2(*sptprice / *strike) / (Vlog2E * sqrtT) +
RVV * sqrtT;
d2 = d1 - VOLATILITY*sqrtT;
CNDF_C( &NofXd1, &d1 );
CNDF_C( &NofXd2, &d2 );
fptype expRT = EXP2(ZR * (*time));
*OptionPrice = ((*sptprice) * NofXd1) ((*strike)*expRT * NofXd2);
*OptionPrice2 = *OptionPrice + expRT -(*sptprice);
return;
}
void *bs_thread(void * arg1)
{
int i, j, k;
fptype priceDelta;
int tid = (int) arg1;
int start = tid * (numOptions / nThreads);
int end = start + (numOptions / nThreads);
for (j=0; j<numRuns; j++) {
#pragma ivdep
#pragma vector aligned
for (i=start; i<end; i++)
BlkSchlsEqEuroNoDiv_C(&(gprice[i]), &(gprice2[i]), &(sptprice[i]), &(strike[i]), &(otime[i]));
}
barrier(tid);
return (NULL);
iXPTC 2013
}
8
Intel® Xeon Phi™ Coprocessor
Thread Affinity using pthreads*
• Partition the workload to avoid load imbalance
– Understand static vs. dynamic workload partition
• Use pthread API, define, initialize, set, destroy
– Set CPU affinity with pthead_setaffinity_np()
– Know the thread enumeration and avoid core 0
– Core 0 boots the coprocessor, job scheduler, service interrupts
Core 0
Core 1
Core 60
Core 2
240
239
238
237
8
7
6
5
4
3
2
1
0
243
242
241
iXPTC 2013
9
Intel® Xeon Phi™ Coprocessor
Options for Parallelism – OpenMP*
• Compiler directives/pragmas based threading constructs
– Utility library functions and Environment variables
• Specify blocks of code executing in parallel
#pragma omp parallel sections
{
#pragma omp section
task1();
#pragma omp section
task2();
• Fork-Join Parallelism:
#pragma omp section
task3();
}
– Master thread spawns a team of worker threads as needed
– Parallelism grow incrementally
Master Thread
Parallel Regions
iXPTC 2013
10
Intel® Xeon Phi™ Coprocessor
OpenMP* Pragmas and Extensions
• OpenMP* pragmas in C/C++:
#pragma omp construct [clause [clause]…]
• Large robust specification that
includes
– Parallel sections and tasks
– Parallel loops
– Synchronization points
• critical sections
• barriers
– Atomic and ordered updates
#pragma omp parallel sections
{
#pragma omp section
{
BinomialTemplate<F32vec8, float>(callResult, S,
X, T, R, V, N, NUM_STEPS);
}
#pragma omp section
#pragma offload target(mic:0) in(S, X, T:length(N))
out(MICExpected, MICConfidence:length(N))
{
montecarlo(S, X, T, MICExpected, MICConfidence);
}
}
– Serial sections within the parallel code
• Extension to support offloading – OpenMP* 4.0 RC2
– Use #pragma omp target or #pragma offload from Intel LEO
– Either syntax works, no performance differences
iXPTC 2013
11
Intel® Xeon Phi™ Coprocessor
OpenMP* Worksharing Construct
Sequential code
for (i = 0; i < N; i++) a[i] = a[i] + b[i];
OpenMP* parallel
region
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
istart = id * N / Nthrds;
iend = (id + 1) * N / Nthrds;
for (i = istart; i < iend; i++)
a[i] = a[i] + b[i];
}
OpenMP* worksharing
construct
#pragma omp parallel
#pragma omp for
for (i = 0; i < N; i++) a[i] = a[i] + b[i];
iXPTC 2013
12
Intel® Xeon Phi™ Coprocessor
OpenMP*: Shared, Private and Reduction Variables
• Default Rules
– Variables defined outside the parallel region are shared
– Variables defined inside the parallel region are private
• Override the defaults
– The private(var) clause creates a local copy of var for each thread
– Loop indices in a parallel for are private by default
• The reduction(op:list) clause is a special case of shared
– Variables in “list” must be shared in the enclosing parallel region
• A local copy of each reduction variable is created and initialized based on the op (0 for “+”)
• Compiler finds reduction expressions containing op and uses them to update the local copy
• Local copies are reduced to a single value and combined with the original global value
#pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref)
{
float local_sum_delta = 0.0f;
for(int i = 0; i < OptPerThread; i++)
{
ref
= callReference;
delta = fabs(callReference - CallResult[i]);
local_sum_delta += delta;
sum_ref
+= fabs(ref);
}
sum_delta += local_sum_delta;
}
iXPTC 2013
13
Intel® Xeon Phi™ Coprocessor
OpenMP* Performance, Scalability Related Issues
• Manage Thread Creation Cost
– Create threads as early as possible, Maximize
the work for worker threads
– IA threads take some time to create, But
once they’re up, they last till the end
• Take advantage of memory locality,
use NUMA memory manager
– Allocate the memory on the thread that will
access them later on.
– Try not to allocate the memory the worker
threads use in the main thread
#pragma omp parallel for
for (int k = 0; k < RAND_N; k++)
h_Random[k] = cdfnorminv ((k+1.0)/(RAND_N+1.0));
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
{
CallResultList[opt]
= 0;
CallConfidenceList[opt] = 0;
}
#pragma omp parallel
{
#ifdef _OPENMP
int threadID = omp_get_thread_num();
#else
int threadID = 0;
#endif
• Ensure your OpenMP* program works
serially, compiles without openmp*
float *CallResult = (float *) scalable_aligned_malloc
(mem_size, SIMDALIGN);
float *PutResult = (float *) scalable_aligned_malloc
(mem_size, SIMDALIGN);
– Protect OpenMP* API calls with _OPENMP,
– Make sure serial works before enable
OpenMP* (e.g. compile with –openmp)
• Minimize the thread synchronization
– use local variable to reduce the need to
access global variable
}
#ifdef _OPENMP
int ThreadNum = omp_get_max_threads();
omp_set_num_threads(ThreadNum);
#else
int ThreadNum = 1;
#endif
iXPTC 2013
14
Intel® Xeon Phi™ Coprocessor
OpenMP* Offload Environment Variables
•
Set/Get the number of coprocessor threads from the host
– Notice that omp_get_max_thread_target()return 4*(ncore-1)
– Use omp_set_num_threads_target() omp_get_num_threads_target()
– Protect under #ifdef __INTEL_OFFLOAD,
•
Access coprocessor environment variables from the host processor
– First define MIC_ENV_PREFIX=MIC
– Issue export MIC_OMP_NUM_THREADS=240 on the host
– OpenMP sets the coprocessor max threads to 240
– Host OpenMP threads still take the cues from OMP_NUM_THREADS
•
Initial Stack Size on the device is default to be 12MB
– Use MIC_STACKSIZE to override the default size for main threads in coprocessor
– Use MIC_OMP_STACKSIZE to change the default stack size for worker threads
15
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Step 4 Parallelization
Step 4: Parallelization
• Add #pragma omp parallel for to the outer loop
• Add –openmp to the C/C++ Compiler invocation
option CCFLAGS
• Rerun the program
• ./MonteCarlo
• Record the performance again
export KMP_AFFINITY=“compact,granularity=fine”
17
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Other Options for Parallelism – Intel® Cilk™ Plus
• C/C++ extension for fine-grained task parallelism
• Three keywords
_Cilk_spawn
• Function call may be run in parallel with caller – up to the runtime
_Cilk_sync
• Caller waits for all children to complete
_Cilk_for
• Iterations are structured into a work queue
• Busy cores do not execute the loop
• Idle cores steal work items from the queue
• Countable loop Granularity is N/2, N/4, N/8, for trip count of N
• Intended use:
– when iterations are not balanced, or
– When overall load is not known at design time
iXPTC 2013
18
Intel® Xeon Phi™ Coprocessor
Offload Using Intel® Cilk™ Plus
•
Intel ® C/C++ Compiler extension with new offloading key words
•
Provide the appearance of shared memory using virtual Shared-memory
technology
Feature
Example
Description
Offloading a function call
x = _Cilk_shared _Cilk_offload
func(y);
func can executes on Intel MIC
Offloading asynchronously
x = _Cilk_spawn _Cilk_offload func(y);
Non blocking offload
Data available on both sides
_Cilk_shared int x = 0;
Allocated in the shared memory area,
can be synchronized
Function available on both
sides
int _Clik_shared f(int x)
{ return x+1}
The function can execute on either side
Offload a parallel for loop
(Requires Cilk on Intel MIC)
_Cilk_offload _Cilk_for (i = 0; i < N;
i++) {
a[i] = b[i] + c[i];
}
Loop executes in parallel on Intel MIC.
The loop is implicitly outlined as a
function call.
(borrow inside the loop disallowed)
Offload array expressions
_Offload a[:] = b[:] <op> c[:];
_Offload a[:] = elemental_func(b[:]);
Array operations execute in parallel on
Intel MIC.
Black-Scholes – Using Intel® C/C++ Compiler
with Cilk™ Plus Technology
double option_price_call_black_scholes(
double S,
// spot (underlying) price
double K,
// strike (exercise) price,
double r,double option_price_call_black_scholes(
// interest rate
__declspec (vector)
double double
sigma, S, // volatility
// spot (underlying) price
double double
time) K, // time//tostrike
maturity
(exercise) price,
{
double r,
// interest rate
double time_sqrt = sqrt(time);
double sigma,
// volatility
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
double time)
// time to maturity
double
d2
=
d1-(sigma*time_sqrt);
{
return double
S*N(d1)time_sqrt
- K*exp(-r*time)*N(d2);
= sqrt(time);
}
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
double d2 = d1-(sigma*time_sqrt);
return S*N(d1) - K*exp(-r*time)*N(d2);
}
// invoke calculations for call-options
for (int i=0; i<NUM_OPTIONS; i++) {
call[i]
= option_price_call_black_scholes(S[i],
K[i], r, sigma, time[i]);
// invoke
calculations for call-options
}
Cilk_for (int i=0; i<NUM_OPTIONS; i++) {
call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);
}
Elemental functions utilize both core and vector parallelism
20
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Black-Scholes – Using Intel® C/C++ Compiler keyword
Extension for Offload
__declspec (vector) double option_price_call_black_scholes(
double S,
// spot (underlying) price
double K,
// strike (exercise) price,
double r,
// interest rate
double sigma,
// volatility
_Shared __declspec
double time)
(vector)
// time
double
to option_price_call_black_scholes(
maturity
{
double S,
// spot (underlying) price
double time_sqrt = sqrt(time);
double K,
// strike (exercise) price,
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
double r,
// interest rate
double d2 = d1-(sigma*time_sqrt);
double sigma,
// volatility
return S*N(d1) - K*exp(-r*time)*N(d2);
double time)
// time to maturity
}
{
double time_sqrt = sqrt(time);
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
double d2 = d1-(sigma*time_sqrt);
return S*N(d1) - K*exp(-r*time)*N(d2);
}
// invoke calculations for call-options: first invocaiton on MIC, second on Xeon
_Offload Cilk_for (int i=0; i<NUM_OPTIONS; i++) {
call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);
}
Cilk_for (int i=0; i < NUM_OPTIONS; i++) {
call[i] = option_price_call_black_scholes(s[i], K[i], r, sigma, time[i]);
}
21
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Running Black-Scholes on Intel® Xeon® Processor
and Intel® Xeon Phi™ Coprocessor in concurrent
_Shared __declspec (vector) double option_price_call_black_scholes(
double S,
// spot (underlying) price
double K,
// strike (exercise) price,
double r,
// interest rate
double sigma,
// volatility
double time)
// time to maturity
{
double time_sqrt = sqrt(time);
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
double d2 = d1-(sigma*time_sqrt);
return S*N(d1) - K*exp(-r*time)*N(d2);
}
_Shared wrapper ()
{
Cilk_for (int i=0; i < NUM_OPTIONS; i++) {
call[i] = option_price_call_balck_scholes(s[i],k[i],r,sigma,time[i]);
}
}
// invoke calculations for call-options: first invocaiton on MIC, second on Xeon
…
Cilk_spawn _Offload wrapper();
wrapper();
Cilk_sync;
…22
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Another for Parallelism – Intel® Threading
Building Blocks (TBB)
• C++ classes and templates that implement task-based parallelism
– As opposed to “threads”
– Makes use of “work stealing” to evenly distribute work across threads and
ensure good cache behavior
• Provides a wide range of template classes to implement efficient
Generic Parallel Algorithms
parallel algorithms
Concurrent Containers
– Generic parallel patterns
– Concurrent containers
– Synchronization primitives
– Memory allocation
– Task scheduling
– Thread local storage
– Etc.
parallel_for(range)
parallel_reduce
parallel_for_each(begin, end)
parallel_do
parallel_invoke
pipeline, parallel_pipeline
parallel_sort
parallel_scan
Task scheduler
task_group
task_structured_group
task_scheduler_init
task_scheduler_observer
Miscellaneous
Threads
tick_count
tbb_thread, thread
concurrent_hash_map
concurrent_queue
concurrent_bounded_queue
concurrent_vector
concurrent_unordered_map
Thread Local Storage
enumerable_thread_specific
combinable
Synchronization Primitives
atomic; mutex; recursive_mutex;
spin_mutex; spin_rw_mutex;
queuing_mutex; queuing_rw_mutex;
reader_writer_lock; critical_section;
condition_variable;
lock_guard; unique_lock;
null_mutex; null_rw_mutex;
Memory
Allocation
Intel
Confidential
tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator
23
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Options for Parallelism - comparison
Pthreads*
OpenMP*
Intel® Cilk™
Plus
Intel®
TBB
Code rewrite required to use
Lots
Little
Little
Moderate
Serial code Source Compatibility
No
yes
likely
No
Compiler Dependency
No
No
Yes
No
Supports Fortran
Yes
Yes
No
No
Supports C
Yes
Yes
Yes
No
Supports C++
Yes
Yes
Yes
Yes
iXPTC 2013
24
Intel® Xeon Phi™ Coprocessor