Profiling - Korea University

Download Report

Transcript Profiling - Korea University

COM503 Parallel Computer Architecture & Programming
Lecture 5. OpenMP Intro.
Prof. Taeweon Suh
Computer Science Education
Korea University
References
• Lots of the lecture slides are based on the
following web materials with some modifications
 https://computing.llnl.gov/tutorials/openMP/
• Official OpenMP page
 http://openmp.org/
 It contains up-to-date information on OpenMP
 It also includes tutorials, book example codes etc.
2
Korea Univ
OpenMP
•
What does OpenMP stand for?


•
Standardized:

•
•
Short version: Open Multi-Processing
Long version: Open specifications for Multi-Processing via collaborative work
between interested parties from the hardware and software industry, government and
academia
Jointly defined and endorsed by a group of major computer hardware and software
vendors
Comprised of three primary API components:

Compiler Directives (#pragma)


Runtime Library Routines
Environment Variables
Portable:


The API is specified for C/C++ and Fortran
Most major platforms have been implemented including Linux and Windows platforms
3
Korea Univ
History
• In the early 90's, shared-memory machine vendors
supplied similar, directive-based, Fortran programming
extensions
 The user would augment a serial Fortran program with directives
specifying which loops were to be parallelized
 The compiler would be responsible for automatically parallelizing
such loops across the SMP processors
• The OpenMP standard specification started in 1997
 Led by the OpenMP Architecture Review Board (ARB)
 The ARB members included Compaq / Digital, HP, Intel, IBM,
Kuck & Associates, Inc. (KAI), Silicon Graphics, Sun
Microsystems , and U.S. Department of Energy
4
Korea Univ
Release History
• Our textbook covers OpenMP 2.5 specification
OpenMP Spec. released together for C and Fortran from OpenMP 2.5
OpenMP 4.0
July 2013
5
Korea Univ
Shared Memory Machines
• OpenMP is designed for shared memory machines
(UMA or NUMA)
6
Korea Univ
Fork-Join Model
•
Begin as a single process (master thread).

•
•
•
The master thread executes sequentially until the first parallel region construct is
encountered.
FORK: the master thread creates a team of parallel threads
The statements in the program that are enclosed by the parallel region
construct are then executed in parallel among the various team threads
JOIN: When the team threads complete the statements in the parallel
region, they synchronize and terminate, leaving only the master thread
Source: https://computing.llnl.gov/tutorials/openMP/#Introduction
7
Korea Univ
Other Parallel Programming Models?
• MPI: Message Passing Interface
 Developed for distributed-memory architectures, where multiple
processes execute independently and communicate data
• Most widely used in the high-end technical computing community,
where clusters are common
 Most vendors of shared memory systems also provide MPI
implementations
 Most MPI implementations consist of a specific set of APIs
callable from C, C++ ,Fortran or Java
 MPI implementations
• MPICH
 Freely available, portable implementation of MPI
 Free Software and is available for most flavors of Unix and Windows
• OpenMPI
8
Korea Univ
Other Parallel Programming Models?
• Pthreads: POSIX (Portable Operating System Interface)
Threads
 Shared-memory programming model
 Defined as a set of C and C++ programming types and
procedure calls
• A collection of routines for creating, managing, and coordinating a
collection of threads
• Programming with Pthreads is much more complex than
with OpenMP
9
Korea Univ
Serial Code Example
• Serial version of dot product program
 The dot product is an algebraic operation that takes two equal-length
sequences of numbers (usually coordinate vectors) and returns a single
number obtained by multiplying corresponding entries and adding up those
products
#include <stdio.h>
sum = 0.0;
int main(argc, argv)
int argc;
char * argv[];
{
double sum;
double a[256], b[256];
for (i=0 ; i<n; i++) {
sum = sum + a[i]*b[i];
}
printf("sum = %9.2lf\n", sum);
}
int i, n;
n = 256;
for (i=0; i< n; i++){
a[i] = i * 0.5;
b[i] = i * 2.0;
}
10
Korea Univ
MPI Example
• MPI
 To compile, ‘mpicc dot_product_mpi.c –o dot_product_mpi’
 To run, ‘mpirun –np 4 machine_file dot_product_mpi’
#include <stdio.h>
#include <mpi.h>
my_first = myid * n/numprocs;
my_last = (myid + 1) * n/numprocs;
int main(argc, argv)
int argc;
char * argv[];
{
double sum, sum_local;
double a[256], b[256];
for (i=0; i< n; i++){
a[i] = i * 0.5;
b[i] = i * 2.0;
}
sum_local = 0.0;
int i, n;
for (i=my_first ; i < my_last; i++) {
sum_local = sum_local + a[i]*b[i];
}
int numprocs, myid, my_first, my_last;
n = 256;
MPI_Allreduce(&sum_local, &sum, 1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD);
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
if (myid==0) printf("sum = %9.2lf\n", sum);
MPI_Finalize();
return 0;
printf("#procs = %d\n", numprocs);
}
11
Korea Univ
Pthreads Example
• Pthreads
 To compile, ‘gcc dot_product_pthread.c –o dot_product_pthread
-pthread’
for (i=0; i< n; i++){
a[i] = i * 0.5;
b[i] = i * 2.0;
}
#include <stdio.h>
#include <pthread.h>
#define NUMTHRDS 4
double sum = 0 ;
double a[256], b[256];
pthread_mutex_init(&mutexsum, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,
PTHREAD_CREATE_JOINABLE);
int status;
int n = 256;
for (i=0; i<NUMTHRDS; i++) {
pthread_create(&thds[i], &attr, dotprod, (void *)i);
}
pthread_t thds[NUMTHRDS];
pthread_mutex_t mutexsum;
pthread_attr_destroy(&attr);
void *dotprod(void *arg);
for (i=0; i<NUMTHRDS; i++) {
pthread_join(thds[i], (void **) &status);
}
int main(argc, argv)
int argc;
char * argv[];
{
pthread_attr_t attr;
int i;
printf("sum = %9.2lf\n", sum);
pthread_mutex_destroy(&mutexsum);
pthread_exit(NULL);
}
12
Korea Univ
Pthreads Example
• Pthreads
 To compile, ‘gcc dot_product_pthread.c –o dot_product_pthread
-pthread’
void *dotprod(void *arg)
{
int myid, i, my_first, my_last;
double sum_local;
myid = (int) arg;
my_first = myid * n / NUMTHRDS;
my_last = (myid+1) * n / NUMTHRDS;
sum_local = 0.0;
for (i=my_first; i< my_last; i++) {
sum_local = sum_local + a[i]*b[i];
}
pthread_mutex_lock(&mutexsum);
sum = sum + sum_local;
pthread_mutex_unlock(&mutexsum);
}
pthread_exit((void *) 0);
13
Korea Univ
Another Pthread Example
• Pthreads
 To compile, ‘gcc pthread_creation.c –o pthread_creation pthread’
void *print_message_function( void *ptr )
{
char *message;
message = (char *) ptr;
printf("%s \n", message);
}
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void *print_message_function( void *ptr );
main()
{
pthread_t thread1, thread2;
char *message1 = "Thread 1";
char *message2 = "Thread 2";
int iret1, iret2;
/* Create independent threads each of which will execute function */
iret1 = pthread_create( &thread1, NULL, print_message_function, (void*) message1);
iret2 = pthread_create( &thread2, NULL, print_message_function, (void*) message2);
/* Wait till threads are complete before main continues. Unless we */
/* wait we run the risk of executing an exit which will terminate */
/* the process and all threads before the threads have completed. */
pthread_join( thread1, NULL);
pthread_join( thread2, NULL);
}
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
exit(0);
14
Korea Univ
OpenMP Example
•
OpenMP

To compile, ‘gcc dot_product_omp.c –o dot_product_omp -fopenmp’
#include <stdio.h>
#include <omp.h>
int main(argc, argv)
int argc;
char * argv[];
{
double sum;
double a[256], b[256];
int i, n;
n = 256;
for (i=0; i< n; i++){
a[i] = i * 0.5;
b[i] = i * 2.0;
}
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0 ; i<n; i++) {
sum = sum + a[i]*b[i];
}
printf("sum = %9.2lf\n", sum);
}
15
Korea Univ
OpenMP
• Compiler Directive Based
 Parallelism is specified through the use of compiler directives in source code
• Nested Parallelism Support
 Parallel regions inside of other parallel regions.
• Dynamic Threads
 Dynamically alter the number of threads to execute parallel regions.
• Memory Model
 Provide a "relaxed-consistency”. In other words, threads can cache their data
and are not required to maintain exact consistency with real memory all the
time.
 When it is critical that all threads view a shared variable identically, the
programmer is responsible for ensuring that the variable is flushed by all
threads as needed.
16
Korea Univ
OpenMP Parallel Construct
•
A parallel region is a block of code executed by multiple threads. This is the fundamental
OpenMP parallel construct.
#pragma omp parallel [clause[[,] clause]…]
structured block
•
This construct is used to specify the block that should be executed in parallel





A team of threads is created to execute the associated parallel region
Each thread in the team is assigned a unique thread number (0 to #thread-1)
The master is a member of that team and has thread number 0
Starting from the beginning of this parallel region, the code is duplicated and all threads will
execute that code.
It does not distribute the work of the region among the threads in a team if the programmer does
not use the appropriate syntax to specify this action
•
There is an implied barrier at the end of a parallel section. Only the master thread
continues execution past this point.
•
The code not enclosed by a parallel construct will be executed serially
17
Korea Univ
Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(argc, argv)
int argc;
char * argv[];
{
#pragma omp parallel
{
printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num());
}
if (omp_get_thread_num() == 2) {
printf(" Thread %d does things differently\n", omp_get_thread_num());
}
}
18
Korea Univ
How Many Threads?
• The number of threads in a parallel region is determined by the
following factors, in order of precedence
1.
Evaluation of the IF clause


2.
The IF clause is supported on the parallel construct only
#pragma omp parallel if (n > 5)
num_threads clause with a parallel construct


The num_threads clause is supported on the parallel construct only
#pragma omp parallel num_threads(8)
3. omp_set_num_threads() library function

4.
OMP_NUM_THREADS environment variable

5.
omp_set_num_threads(8)
In bash, use ‘export OMP_NUM_THREADS=4’
Implementation default - usually the number of CPUs on a node,
even though it could be dynamic
19
Korea Univ
Example
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NUM_THREADS 8
int main(argc, argv)
int argc;
char * argv[];
{
int n = 6;
omp_set_num_threads(NUM_THREADS);
//#pragma omp parallel
#pragma omp parallel if (n > 5) num_threads(n)
{
printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num());
}
if (omp_get_thread_num() == 2) {
printf(" Thread %d does things differently\n", omp_get_thread_num());
}
}
20
Korea Univ
Work-Sharing Constructs
• Work-sharing constructs are used to distribute computation
among the threads in a team
 #pragma omp for
 #pragma omp sections
 #pragma omp single
• By default, threads wait at a barrier at the end of a worksharing region until the last thread has completed its share of
the work
 However, the programmer can suppress this by using the nowait
clause
21
Korea Univ
Loop Construct
• The loop construct causes the immediately following loop iterations
to be executed in parallel
#pragma omp for [clause[[,] clause]…]
for loop
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 4
int main(argc, argv)
int argc;
char * argv[];
{
int n = 8;
int i;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel shared(n) private(i)
{
#pragma omp for
for (i=0 ; i<n; i++)
printf(" Thread %d executes loop iteration %d\n", omp_get_thread_num(), i);
}
}
22
Korea Univ
Section Construct
• The section construct is the easiest way to have different
threads execute different kinds of work
#pragma omp sections[clause[[,] clause]…]
{
[#pragma omp section]
structured block
[#pragma omp section]
structured block
}
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NUM_THREADS 4
void funcA();
void funcB();
int main(argc, argv)
int argc;
char * argv[];
{
int n = 8;
int i;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel sections
{
#pragma omp section
funcA();
void funcA()
{
printf("In funcA: this section is executed by thread %d\n", omp_get_thread_num());
}
void funcB()
{
printf("In funcB: this section is executed by thread %d\n", omp_get_thread_num());
}
23
}
#pragma omp section
funcB();
}
Korea Univ
Section Construct
• At run time, the specified code blocks are executed
by the threads in the team
 Each thread executes one code block at a time
 Each code block will be executed exactly once
 If there are fewer threads than code blocks, some threads
execute multiple code blocks
 If there are fewer code blocks than threads, the remaining
threads will be idle
 Assignment of code blocks to threads is implementationdependent
 Depending on the type of work performed in the various
code blocks and the number of threads used, this
construct might lead to a load-balancing problem
24
Korea Univ
Single Construct
• The single construct specifies that the block should be executed
by one thread only
#include <stdio.h>
#include <omp.h>
#pragma omp single [clause[[,] clause]…]
structured block
#define NUM_THREADS 4
#define N 8
int main(argc, argv)
int argc;
char * argv[];
{
int i, a, b[N];
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel shared(a, b) private(i)
{
#pragma omp single
{
a = 10;
printf("Single construct executed by thread %d\n", omp_get_thread_num());
}
}
}
#pragma omp for
for (i=0; i<N; i++) b[i] = a;
printf("After the parallel region\n");
for (i=0; i<N; i++)
printf("b[%d] = %d\n", i, b[i]);
25
Korea Univ
Single Construct
• Only one thread executes the block with the single
construct
 The other threads wait at a implicit barrier until the thread
executing the single code block has completed
• What if the single construct is omitted in the
previous example?
 Memory consistency issue?
 Performance issue?
 A barrier is required then before the #pragma omp for?
26
Korea Univ
Misc
• A useful Linux command
 top (Display Linux tasks) provides a dynamic realtime view of a running system
• Try 1, z, H after running the top command
27
Korea Univ
Misc
• Useful Linux commands
 ps –eLf
• Display thread IDs for OpenMP and Pthreads
 top
• Display process IDs, which can be used to monitor the processes
created by MPI
28
Korea Univ
Misc
• top does not show threads
• ps –eLf
 Display thread IDs for OpenMP and Pthreads
29
Korea Univ
Backup Slides
30
Korea Univ
Goal of OpenMP
• Standardization:
 Provide a standard among a variety of shared memory
architectures/platforms
• Lean and Mean:
 Establish a simple and limited set of directives for programming shared
memory machines.
 Significant parallelism can be implemented by using just 3 or 4 directives.
• Ease of Use:
 Provide capability to incrementally parallelize a serial program, unlike
message-passing libraries which typically require an all or nothing approach
 Provide the capability to implement both coarse-grain and fine-grain
parallelism
• Portability:
 Supports Fortran (77, 90, and 95), C, and C++
 Public forum for API and membership
31
Korea Univ