Diapositiva 1 - Micrel Lab @ DEIS

Download Report

Transcript Diapositiva 1 - Micrel Lab @ DEIS

PARALLEL PROGRAMMING WITH
OPENMP
Ing. Andrea Marongiu
[email protected]
Programming model: OpenMP


De-facto standard for the shared memory programming
model
A collection of compiler directives, library routines and
environment variables

Easy to specify parallel execution within a serial code

Requires special support in the compiler

Generates calls to threading libraries (e.g. pthreads)

Focus on loop-level parallel execution

Popular in high-end embedded
Fork/Join Parallelism
Sequential program
Parallel program




Initially only master thread is active
Master thread executes sequential code
Fork: Master thread creates or awakens additional threads to execute parallel code
Join: At the end of parallel code created threads are suspended upon barrier synchronization
Pragmas

Pragma: a compiler directive in C or C++

Stands for “pragmatic information”

A way for the programmer to communicate with the compiler


Compiler free to ignore pragmas: original sequential semantic
is not altered
Syntax:
#pragma omp <rest of pragma>
Components of OpenMP
 Data scope attributes
 private
 shared
 reduction
Directives
 Parallel regions
Clauses
 #pragma omp parallel
 static
 dynamic
 Work sharing
 #pragma omp for
 #pragma omp sections
 Synchronization
 #pragma omp barrier
 #pragma omp critical
 #pragma omp atomic
 Loop scheduling
 Thread Forking/Joining
 omp_parallel_start()
 omp_parallel_end()
 Loop scheduling
 Thread IDs
Runtime Library
 omp_get_thread_num()
 omp_get_num_threads()
Outlining parallelism
The parallel directive



Fundamental construct to outline
parallel computation within a
sequential program
Code within its scope is replicated
among threads
Defers implementation of parallel
execution to the runtime (machinespecific, e.g. pthread_create)
A sequential program..
..is easily parallelized
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
Code originally contained
within the scope of the
pragma is outlined to a new
function within the compiler
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
The #pragma construct in
the main function is
replaced with function calls
to the runtime library
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
First we call the runtime to
fork new threads, and pass
them a pointer to the
function to execute in
parallel
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
Then the master itself calls
the parallel function
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
Finally we call the runtime
to synchronize threads with
a barrier and suspend them
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int parfun(…)
{
printf (“\nHello world!”);
}
int main()
{
#pragma omp parallel
{
printf (“\nHello world!”);
}
}
#pragma omp parallel
Data scope attributes
int main()
Call runtime to get thread ID:
{
Every thread sees a different value
int id;
int a = 5;
#pragma omp parallel
Master and slave threads
{
access the same variable a
id = omp_get_thread_num();
if (id == 0)
printf (“Master: a = %d.”, a*2);
else
printf (“Slave: a = %d.”, a);
}
}
A slightly more complex example
#pragma omp parallel
Data scope attributes
int main()
Call runtime to get thread ID:
{
Every thread sees a different value
int id;
int a = 5;
#pragma omp parallel
Master and slave threads
{
access the same variable a
id = omp_get_thread_num();
if (id == 0)
printf (“Master: a = %d.”, a*2);
else
printf (“Slave: a = %d.”, a);
}
}
A slightly more complex example
#pragma omp parallel
Data scope attributes
int main()
Insert code to retrieve the address
of the shared object from within
{
each parallel thread
int id;
int a = 5;
#pragma omp parallel shared (a) private (id)
{
id = omp_get_thread_num();
if (id == 0)
printf (“Master: a = %d.”, a*2); Allow symbol privatization:
Each thread contains a
else
private copy of this variable
printf (“Slave: a = %d.”, a);
}
}
A slightly more complex example
#pragma omp parallel
Data scope attributes
int main()
Insert code to retrieve the address
of the shared object from within
{
each parallel thread
int id;
int a = 5;
#pragma omp parallel shared (a) private (id)
{
id = omp_get_thread_num();
if (id == 0)
printf (“Master: a = %d.”, a*2); Allow symbol privatization:
Each thread contains a
else
private copy of this variable
printf (“Slave: a = %d.”, a);
}
}
A slightly more complex example
#pragma omp parallel
Data scope attributes
int main()
Insert code to retrieve the address
of the shared object from within
{
each parallel thread
int id;
int a = 5;
#pragma omp parallel shared (a) private (id)
{
id = omp_get_thread_num();
if (id == 0)
printf (“Master: a = %d.”, a*2); Allow symbol privatization:
Each thread contains a
else
private copy of this variable
printf (“Slave: a = %d.”, a);
}
}
A slightly more complex example
Sharing work among threads
The for directive


The parallel pragma instructs every thread to execute all of
the code inside the block
If we encounter a for loop that we want to divide among
threads, we use the for pragma
#pragma omp for
#pragma omp for
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int main()
{
#pragma omp parallel for
{
for (i=0; i<10; i++)
a[i] = i;
}
}
int parfun(…)
{
int LB = …;
int UB = …;
for (i=LB; i<UB; i++)
a[i] = i;
}
#pragma omp for
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int main()
{
#pragma omp parallel for
{
for (i=0; i<10; i++)
a[i] = i;
}
}
int parfun(…)
{
int LB = …;
int UB = …;
for (i=LB; i<UB; i++)
a[i] = i;
}
#pragma omp for
int main()
{
omp_parallel_start(&parfun, …);
parfun();
omp_parallel_end();
}
int main()
{
#pragma omp parallel for
{
for (i=0; i<10; i++)
a[i] = i;
}
}
int parfun(…)
{
int LB = …;
int UB = …;
for (i=LB; i<UB; i++)
a[i] = i;
}
The schedule clause
Static Loop Partitioning
Es. 12 iterations (N), 4 threads (Nthr)
#pragma omp for schedule(static)
{
for (i=0; i<12; i++)
a[i] = i;
}
DATA CHUNK
N
C = ceil (
3
iterations
thread
)
Nthr
Useful for:
• Simple, regular loops
• Iterations with equal duration
LOWER BOUND
UPPER BOUND
Iteration space
Thread ID (TID)
0
1
2
3
LB = C * TID
0
3
6
9
3
6
9
12
UB = min { [C * ( TID + 1) ], N}
The schedule clause
Static Loop Partitioning
Es. 12 iterations (N), 4 threads (Nthr)
#pragma omp for schedule(static)
{
for (i=0; i<12; i++)
a[i] = i;
}
DATA CHUNK
N
C = ceil (
3
iterations
thread
)
Nthr
Useful for:
• Simple, regular loops
• Iterations with equal duration
LOWER BOUND
UPPER BOUND
Iteration space
Thread ID (TID)
0
1
2
3
LB = C * TID
0
3
6
9
3
6
9
12
UB = min { [C * ( TID + 1) ], N}
The schedule clause
Static Loop Partitioning
Iteration space
#pragma omp for schedule(static)
{
for (i=0; i<12; i++)
a[i] = i;
{
}
int start = rand();
int count = 0;
1
2
3
while (start++ < 256)
count++;
1
a[count] = foo();
}
2
}
3
UNBALANCED
workloads
4
5
6
4
5
7
8
7
6
8
9
9 10 11 12
10
11
12
The schedule clause
Dynamic Loop Partitioning
Iteration space
#pragma omp for schedule(static)
schedule(dynamic)
{
for (i=0; i<12; i++)
{
int start = rand();
int count = 0;
while (start++ < 256)
count++;
a[count] = foo();
}
}
The schedule clause
Dynamic Loop Partitioning
Iteration space
Runtime environment
Work queue
int parfun(…)
{
int LB, UB;
GOMP_loop_dynamic_next(&LB, &UB);
for (i=LB; i<UB; i++) {…}
}
The schedule clause
Dynamic Loop Partitioning
Iteration space
1
2
3
4
5
7
6
8
9
10
11
12
1
6
10
12
2
7
3
8
4
5
9
11
BALANCED
workloads
Sharing work among threads
The sections directive

The for pragma allows to exploit data parallelism in loops

OpenMP also provides a directive to exploit task parallelism
#pragma omp sections
Task Parallelism Example
int main()
{
v = alpha();
w = beta ();
y = delta ();
x = gamma (v, w);
z = epsilon (x, y));
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
v = alpha();
w = beta ();
}
#pragma omp parallel sections {
y = delta ();
x = gamma (v, w);
}
z = epsilon (x, y));
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
v = alpha();
w = beta ();
}
#pragma omp parallel sections {
y = delta ();
x = gamma (v, w);
}
z = epsilon (x, y));
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
#pragma omp section
v = alpha();
#pragma omp section
w = beta ();
}
#pragma omp parallel sections {
#pragma omp section
y = delta ();
#pragma omp section
x = gamma (v, w);
}
z = epsilon (x, y));
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
v = alpha();
w = beta ();
y = delta ();
x = gamma (v, w);
z = epsilon (x, y));
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
v = alpha();
w = beta ();
y = delta ();
}
#pragma omp parallel sections {
x = gamma (v, w);
z = epsilon (x, y));
}
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
v = alpha();
w = beta ();
y = delta ();
}
#pragma omp parallel sections {
x = gamma (v, w);
z = epsilon (x, y));
}
printf (“%f\n”, z);
}
Task Parallelism Example
int main()
{
#pragma omp parallel sections {
#pragma omp section
v = alpha();
#pragma omp section
w = beta ();
#pragma omp section
y = delta ();
}
#pragma omp parallel sections {
#pragma omp section
x = gamma (v, w);
#pragma omp section
z = epsilon (x, y));
}
printf (“%f\n”, z);
}
#pragma omp barrier




Most important synchronization mechanism in shared memory
fork/join parallel programming
All threads participating in a parallel region wait until
everybody has finished before computation flows on
This prevents later stages of the program to work with
inconsistent shared data
It is implied at the end of parallel constructs, as well as for and
sections (unless a nowait clause is specified)
#pragma omp critical


Critical Section: a portion of code that only one thread at
a time may execute
We denote a critical section by putting the pragma
#pragma omp critical
in front of a block of C code
-finding code example
double area, pi, x;
int i, n;
#pragma omp parallel for private(x) \
shared(area)
{
for (i=0; i<n; i++) {
x = (i + 0.5)/n;
area += 4.0/(1.0 + x*x);
}
}
pi = area/n;
Race condition

Ensure atomic updates of the shared variable area to avoid a
race condition in which one process may “race ahead” of
another and ignore changes
Race condition (Cont’d)
• Thread A reads “11.667” into a local register
• Thread B reads “11.667” into a local register
• Thread A updates area with “11.667+3.765”
• Thread B ignores write from thread A and updates area with “11.667 + 3.563”
time
-finding code example
double area, pi, x;
int i, n;
#pragma omp parallel for private(x) shared(area)
{
for (i=0; i<n; i++) {
x = (i +0.5)/n;
#pragma omp critical
area += 4.0/(1.0 + x*x);
}
}
pi = area/n;
#pragma omp critical protects the code within its scope by acquiring a
lock before entering the critical section and releasing it after execution
Correctness, not performance!




As a matter of fact, using locks makes execution sequential
To dim this effect we should try use fine grained locking (i.e.
make critical sections as small as possible)
A simple instruction to compute the value of area in the
previous example is translated into many more simpler
instructions within the compiler!
The programmer is not aware of the real granularity of the
critical section
Correctness, not performance!




As a matter of fact, using locks makes execution sequential
To dim this effect we should try use fine grained locking (i.e.
make critical sections as small as possible)
This
is aofdump
ofthe
the
A simple instruction to compute the
value
area in
intermediate
previous example is translated into many
more simpler
instructions within the compiler! representation of the
program
within the
The programmer is not aware of the
real granularity
of the
compiler
critical section
Correctness, not performance!




As a matter of fact, using locks makes execution sequential
To dim this effect we should try use fine grained locking (i.e.
make critical sections as small as possible)
A simple instruction to compute the value of area in the
previous example is translated into many more simpler
instructions within the compiler!
The programmer is not aware of the real granularity of the
critical section
Correctness, not performance!




As a matter of fact, using locks makes execution sequential
To dim this effect we should try use fine grained locking (i.e.
make critical sections as small as possible)
A simple instruction to compute the value of area in the
call runtime to acquire lock
previous example is translated into many more simpler
instructions within the compiler!
Lock-protected
The programmer is not aware
of the real granularity of the
operations
critical section
(critical section)
call runtime to release lock
-finding code example
double area, pi, x;
int i, n;
#pragma omp parallel for \
private(x)
\
shared(area)
{
for (i=0; i<n; i++) {
x = (i +0.5)/n;
Parallel
#pragma omp critical
area += 4.0/(1.0 + x*x);
Sequential
}
}
pi = area/n;
Waiting for lock
Correctness, not performance!

A programming pattern such as area += 4.0/(1.0 + x*x);
in which we:



Fetch the value of an operand
Add a value to it
Store the updated value
is called a reduction, and is commonly supported by parallel
programming APIs

OpenMP takes care of storing partial results in private
variables and combining partial results after the loop
Correctness, not performance!
double area, pi, x;
int i, n;
#pragma omp parallel for private(x) shared(area) reduction(+:area)
{
for (i=0; i<n; i++) {
x = (i +0.5)/n;
area += 4.0/(1.0 + x*x);
}
}
pi = area/n;
The reduction clause instructs the compiler to create private copies of
the area variable for every thread. At the end of the loop partial sums are
combined on the shared area variable
Correctness, not performance!
double area, pi, x;
int i, n;
#pragma omp parallel for private(x) shared(area) reduction(+:area)
{
for (i=0; i<n; i++) {
x = (i +0.5)/n;
area += 4.0/(1.0 + x*x);
}
}
pi = area/n;
The reduction clause instructs the compiler to create private copies of
the area variable for every thread. At the end of the loop partial sums are
combined on the shared area variable
Correctness, not performance!
double area, pi, x;
int i, n;
#pragma omp parallel for private(x) shared(area) reduction(+:area)
{
for (i=0; i<n; i++) {
x = (i +0.5)/n;
area += 4.0/(1.0 + x*x);
}
}
pi = area/n;
The reduction clause instructs the compiler to create private copies of
the area variable for every thread. At the end of the loop partial sums are
combined on the shared area variable
Summary
Customizing OpenMP for Efficient Exploitation of the
Memory Hierarchy



Memory latency is well recognized as a severe
performance bottleneck
MPSoCs feature complex memory hierarchy, with multiple
cache levels, private and shared on-chip and off-chip
memories
Using efficiently the memory hierarchy is of the utmost
importance to exploit the computational power of MPSoCs,
but..
Customizing OpenMP for Efficient Exploitation of the
Memory Hierarchy



It is a difficult task, requiring deep understanding of the
application and its memory access pattern
OpenMP standard doesn’t provide any facilities to deal
with data placement and partitioning 
Customization of the programming interface would bring
the advantages of OpenMP to the MPSoC world
Extending OpenMP to support Data Distribution



We need a custom directive that enables specific code analysis
and transformation
When static code analysis can’t tell how to distribute data we
must rely on profiling
The runtime is responsible for exploiting this information to
efficiently map arrays to memories
Extending OpenMP to support Data Distribution

The entire process is driven by the custom #pragma omp
distributed directive
Originally stack-allocated arrays are transformed into
pointers to allow for their explicit placement throughout the
memory hierarchy within the program
{
{
int *A;
int A[m];
float *B;
float B[n];
#pragma omp distributed (A, B)
A = distributed_malloc (m);
…
B = distributed_malloc (n);
}
…
}
Extending OpenMP to support Data Distribution

The entire process is driven by the custom #pragma omp
distributed directive
The transformed program invokes the runtime to retrieve
profile information which drive data placement
{
{
int *A;
int A[m];
float *B;
float B[n];
#pragma omp distributed (A, B)
A = distributed_malloc (m);
…
B = distributed_malloc (n);
}
…
}
When no profile information is found, the distributed_malloc returns a pointer to the shared memory
Data partitioning




OpenMP model is focused on loop parallelism
In this parallelization scheme multiple threads may access different sections
(discontiguous addresses) of shared arrays
Data partitioning is the process of tiling data arrays and placing the tiles in
memory such that a maximum number of accesses are satisfied from local
memory
Most obvious implementation of this concept is the data cache, but..


Inter-array discontiguity often causes cache conflicts
Embedded systems impose constraints on energy, predictability, real-time that
often make caches not suitable
A simple example
The iteration space is partitioned
between the processors
#pragma omp parallel for
for (i = 0; i < 4; i++)
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
INTERCONNECT
ITERATION SPACE
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
SHARED
MEMORY
A simple example
Data space overlaps with
iteration space. Each processor
accesses a different tile
#pragma omp parallel for
for (i = 0; i < 4; i++)
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
Array is accessed with the loop
induction variables
ITERATION SPACE
INTERCONNECT
DATA SPACE
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
A(i,j)
SHARED
MEMORY
A simple example
The compiler can actually split the matrix
into four smaller arrays and allocate them
onto scratchpads
#pragma omp parallel for
for (i = 0; i < 4; i++)
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
No access to remote memories through the
bus, since data is allocated locally
INTERCONNECT
ITERATION SPACE
DATA SPACE
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
A(i,j)
SHARED
MEMORY
Another example
Different locations within array hist are
accessed by many processors
#pragma omp parallel for
for (i = 0; i < 4; i++)
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
for (j = 0; j < 6; j++)
hist[A[i][j]]++;
hist
INTERCONNECT
3,0 3,1 3,2 3,3 3,4 3,5
1
1
4
5
5
4
2,0 2,1 2,2 2,3 2,4 2,5
1
1
0
2
3
5
1,0 1,1 1,2 1,3 1,4 1,5
3
2
2
3
0
0
0,0 0,1 0,2 0,3 0,4 0,5
3
7
7
5
4
4
ITERATION SPACE
A
SHARED
MEMORY
Another example

In this case static code analysis can’t tell anything on array
access pattern

How to decide most efficient partitioning?

Split array in as many tiles as there are processors

Use access count information to map tiles to the processor that
has most accesses to it
Another example
Now processors need to access remote
scratchpads, since they work on
multiple tiles!!!
TILE 1
Access count
PROC1 1
PROC2 2
PROC3 3
PROC4 2
TILE 2
Access count
PROC1 1
PROC2 4
PROC3 2
PROC4 0
TILE 3
Access count
PROC1 3
PROC2 0
PROC3 1
PROC4 4
TILE 4
Access count
PROC1 2
PROC2 0
PROC3 0
PROC4 1
SPM
SPM
SPM
SPM
CPU1
CPU2
CPU2
CPU2
hist
INTERCONNECT
3,0 3,1 3,2 3,3 3,4 3,5
1
1
4
5
5
4
2,0 2,1 2,2 2,3 2,4 2,5
1
1
0
2
3
5
1,0 1,1 1,2 1,3 1,4 1,5
3
2
2
3
0
0
0,0 0,1 0,2 0,3 0,4 0,5
3
7
7
5
4
4
ITERATION SPACE
A
SHARED
MEMORY
Problem with data partitioning





If there is no overlapping of iteration space and data space it may happen that
multiple processor need to access different tiles
In this case data partitioning introduces addressing difficulties because the data
tiles can become discontiguous in physical memory
How to address the problem of generating efficient code to access data when
performing loop and data partitioning?
We can further extend the OpenMP programming interface to deal with that!
The programmer only has to specify the intention of partitioning an array
throughout the memory hierarchy, and the compiler does the necessary
instrumentation
Code Instrumentation
In general, the steps for addressing an array element using tiling are:

Computation of the offset w.r.t. the base address

Identify the tile to which this element belongs to

Re-compute the index relative to the current tile

Load the tile base address from a metadata array
This metadata array is populated during the memory allocation step of the
tool-flow. It relies on access count information to figure out the most efficient
mapping of array tiles to memories.
Extending OpenMP to support data partitioning
The instrumentation process is driven
by the custom tiled clause, which can
be coupled with every parallel and
work-sharing construct.
#pragma omp parallel tiled(A)
{
…
/* Access memory */
A[i][j] = foo();
…
}
{
/* Compute offset, tile and
index for distributed array */
int offset = …;
int tile = …;
int index = …;
/* Read tile base address */
int *base = tiles[dvar][tile];
/* Access memory */
base[index] = foo();
…
}