High Performance Computing 811

Download Report

Transcript High Performance Computing 811

Computational Methods
in Physics
PHYS 3437
Dr Rob Thacker
Dept of Astronomy & Physics (MM-301C)
[email protected]
Today’s Lecture

Introduction to parallel programming
Concepts – what are parallel computers, what is
parallel programming?
 Why do you need to use parallel programming?


When parallelism will be beneficial
Amdahl’s Law
 Very brief introduction to OpenMP

Why bother to teach this in an
undergrad Physics course?

Because parallel computing is now ubiquitous


Dual/Quad core chips are already standard, in the future we can look forward
to 8/16/32/64 cores per chip!



Such codes are said to be parallel
Exposure to these concepts will help you significantly if you want to go to
grad school in an area that uses computational methods extensively


Actually Sun Microsystems already sells a chip with 8 cores
I predict that by 2012 you will be buying chips with 16 cores
If we want to use all this capacity, then we will need to run codes that can use
more than one CPU core at a time


Most laptops are parallel computers, for example
Because not many people have these skills!
If you are interested, an excellent essay on how computing is changing can be
found here:

http://view.eecs.berkeley.edu/wiki/Main_Page
Some caveats


In two lectures we cannot cover very much on parallel
computing
We will concentrate on the simplest kind of parallel
programming



Exposes some of the inherent problems
Still gives you useful increased performance
Remember, making a code run 10 times faster turns a
week into a day!

The type of programming we’ll be looking at is often limited
in terms of the maximum speed-up possible, but factors of
10 are pretty common
Why can’t the compiler just make my
code parallel for me?

In some situations it can, but most of the time it can’t




Some people have suggested writing parallel languages that only
allow the types of code that can be easily parallelized


You really are smarter than a compiler is!
There are many situations where a compiler will not be able to make
something parallel but you can
Compilers that can attempt to parallelize code are called “autoparallelizing”
These have proven to not be very popular and are too restrictive
At present, the most popular way of parallel programming is to
add additional commands to your original code

These commands are sometimes called pragmas or directives
Recap: von Neumann architecture



First practical storedprogram architecture
Still in use today
Speed is limited by the
bandwidth of data
between memory and
processing unit

“von Neumann”
bottleneck
Developed while working on the EDVAC design
Machine instructions are
encoded in binary & stored – key insight!
MEMORY
DATA
MEMORY
PROGRAM
MEMORY
CONTROL
UNIT
CPU
PROCESS
UNIT
OUTPUT
INPUT
Shared memory computers
Program these computers
using “OpenMP” extensions to C,FORTRAN
CPU
CPU
CPU
CPU
MEMORY
Traditional shared memory design – all processors share a memory bus
All of the processors see the share the same memory locations. This
means that programming these computers is reasonably straightforward.
Sometimes called “SMP”s for symmetric multi-processor.
Distributed memory computers
Program these computers using
MPI or PVM extensions to C, FORTRAN
NETWORK
CPU
CPU
CPU
CPU
MEMORY
MEMORY
MEMORY
MEMORY
Really a collection of computers linked together via a network.
Each processor has its own memory and must communicate with
other processors over the network to get information from other
memory locations. This is really quite difficult at times.
This is the architecture of “computer clusters” (you could actually
have each “CPU” here be a shared memory computer).
Parallel execution




What do we mean by being
able to do things in parallel?
Suppose the input data of an
operation is divided into
series of independent parts
Processing of the parts is
carried out independently
A simple example is
operations on vectors/arrays
where we loop over array
indices
ARRAY A(i)
do i=1,10
a(i)=a(i)*2.
end do
do i=11,20
a(i)=a(i)*2.
end do
do i=21,30
a(i)=a(i)*2.
end do
Task 1 Task 2 Task 3
Some subtleties





However, you can’t always do this
Consider
do i=2,n
a(i)=a(i-1)
end do
This kind of loop has what we call a dependence
If you update a value of a(i) before a(i-1) has been
updated then you will get the wrong answer compared
to running on a single processor
We’ll talk a little more about this later, but it does mean
that not every loop can be “parallelized”
Issues to be aware of

Parallel computing is not about being “cool” and doing
lots and lots of “flops”


We want solutions to problems in a reasonable amount
of time



Flops = floating point operations per second
Sometimes that means doing a lot of calculations – e.g.
consider what we found about the number of collisions for
molecules in air
Gains from algorithmic improvements will often
swamp hardware improvements
Don’t be brain-limited, if there is a better algorithm use
it
Algorithmic Improvements in n-body
simulations
Improvements in the speed of algorithms are proportionally better than the speed
increase of computers over the same time interval.
Identifying Performance Desires
Positive Precondition
Frequency of Use
Code Evolution timescale
Daily
Hundreds of executions
between changes
Monthly
Yearly
Negative Precondition
Changes each run
Performance Characteristics
Positive Precondition
Execution Time
Level of Synchronization
Days
None
Hours
Minutes
Negative Precondition
Infrequent (every minute)
Frequent (many per
second)
Data and Algorithm
Positive Precondition
Negative Precondition
Algorithmic complexity*
Data structures
Simple
Regular, static
Complex
Irregular, dynamic
*approximately the
number of stages
Requirements
Positive Precondition
Must significantly increase
resolution/length of integration
Need a factor of 2 increase
Negative Precondition
Current resolution
meets needs
How much speed-up can we
achieve?







Some parts of a code cannot be run in parallel
For example the loop over a(i)=a(i-1) from earlier
Any code that cannot be executed in parallel is said to be serial or
sequential
Lets suppose in terms of the total execution time of a program a
fraction fs has to be run in serial, while fp can be run in parallel
on n cpus
Equivalently the time spent in each fraction will be ts and tp so
the total time on 1 cpu is t1cpu=ts+tp
If we can run the parallel fraction on n cpus then it will take a
time tp/n
The total time will then be tncpu=ts+tp/n
Amdahl’s Law


How much speed-up (Sn=t1cpu/tncpu) is feasible?
Amdahl’s Law is the most significant limit. Given our
previous results and n processors, the maximum
speed-up is given by:
Sn 

t1cpu
t ncpu
ts  t p
n


t s  t p / n (n  1) f s  1
Only if the serial fraction fs(=ts/(ts+tp)) is zero is
perfect speed-up possible (at least in theory)
Have to achieve
excellent parallelism
here!
Amdahl’s Law
120
Speed-up
100
80
fs=0.1
fs=0.01
fs=0.001
60
40
20
0
1
Scaling similar for different
fs here
20
40
60
Ncpu
80
100
120
What is OpenMP?

OpenMP is a “pragma” based “application programmer
interface” (API) that provides a simple extension to
C/C++ and FORTRAN




Pragma is just a fancy word for “instructions”
It is exclusively designed for shared memory
programming
Ultimately, OpenMP is a very simple interface to
something called threads based programming
What actually happens when you break up a loop into
pieces is that a number of threads of execution are created
that can run the loop pieces in parallel
Threads based execution

Serial execution, interspersed with parallel
Serial Section
Serial Section
Serial Section
Master
Thread
Parallel Section
Parallel Section
In practice many compilers block execution of the extra threads during
serial sections, this saves the overhead of the `fork-join’ operation
Some background to threads
programming


There is actually an entire set of commands in C
to allow you to create threads
You could, if you wanted, program with these
commands


The most common thread standard is called POSIX
However, OpenMP provides a simple interface
to a lot of the functionality provided by threads

If it is simple, and does what you need why bother
going to the effort of using threads programming?
Components of OpenMP
Runtime
Library
Routines
(Compiler)
Directives
(Pragmas in
your code)
Environment
Variables
(set at Unix
prompt)
OpenMP: Where did it come from?






Prior to 1997, vendors all had their own proprietary shared
memory programming commands
Programs were not portable from one SMP to another
Researchers were calling for some kind of portability
ANSI X3H5 (1994) proposal tried to formalize a shared memory
standard – but ultimately failed
OpenMP (1997) worked because the vendors got behind it and
there was new growth in the shared memory market place
Very hard for researchers to get new languages supported now,
must have backing from computer vendors!
Bottomline



For OpenMP & shared memory programming in
general, one only has to worry about parallelism of
work
This is because all the processors in a shared-memory
computer can see all the same memory locations
On distributed-memory computers one has to worry
both about parallelism of the work and also the
placement of data


Is the value I need in the memory of another processor?
Data movement is what makes distributed-memory
codes (usually written in something called MPI) so
much longer – it can be highly non-trivial

Although it can be easy – it depends on the algorithm
First Steps

Loop level parallelism is the simplest and easiest
way to use OpenMP



Take each do loop and make it parallel (if possible)
It allows you to slowly build up parallelism
within your application
However, not all loops are immediately
parallelizeable due to dependencies
Loop Level Parallelism

Consider the single precision vector addmultiply operation Y=aX+Y (“SAXPY”)
C/C++
for (i=1;i<=n;++i) {
Y[i]+=a*X[i];
}
FORTRAN
do i=1,n
Y(i)=a*X(i)+Y(i)
end do
C$OMP PARALLEL DO
#pragma omp parallel for \
private(i) shared(X,Y,n,a) C$OMP& DEFAULT(NONE)
C$OMP& PRIVATE(i),SHARED(X,Y,n,a)
for (i=1;i<=n;++i) {
do i=1,n
Y[i]+=a*X[i];
Y(i)=a*X(i)+Y(i)
}
end do
In more detail
Comment pragmas for
FORTRAN - ampersand
necessary for continuation
Denotes this is a region of code
for parallel execution
Good programming practice, must
C$OMP PARALLEL DO
C$OMP& DEFAULT(NONE)
declare nature of all variables
C$OMP& PRIVATE(i),SHARED(X,Y,n,a)
do i=1,n
Y(i)=a*X(i)+Y(i)
Thread SHARED variables: all threads can
end do
access these variables, but must not update
individual memory locations simultaneously
Thread PRIVATE variables: each thread
must have their own copy of this variable
(in this case i is the only private variable)
A quick note



To be fully lexically correct you may want to include an
C$OMP END PARALLEL DO
In f90 programs use !$OMP as a sentinel
Notice that the sentinels mean that the OpenMP
commands look like comments


A compiler that has OpenMP compatibility turned on will see
the comments after the sentinel
This means you can still compile the code on computers that
don’t have OpenMP
How the compiler handles OpenMP

When you compile an OpenMP code you need to add
“flags” to the compile line, e.g.



f77 –openmp –o myprogram myprogram.f
Unfortunately different compilers have different commands
for turning on OpenMP support, the above will work on Sun
machines
When the compiler flag is turned on, you now force
the compiler to link in all of the additional libraries
(and so on) necessary to run the threads

This is all transparent to you though
Requirements for parallel loops


To divide up the work the compiler needs to know the number
of iterations to be executed – the trip count must be
computable
They must also not exhibit any of the dependencies we
mentioned



DO WHILE is not parallelizable using these directives


We’ll review this more in the next lecture
Actually a good test for dependencies is running the loop from n to 1,
rather than 1 to n. If you get a different answer that suggests there are
dependencies
There is actually a way of parallelizing DO WHILE using a different set
of OpenMP commands, but we don’t have time to cover that
The loop can only have one exit point – therefore BREAK or
GOTOs are not allowed
Performance limitations






Each time you start and end a parallel loop there is an
overhead associated with the threads
These overheads must always be added to the time
taken to calculate the loop itself
Therefore there is a limit on the smallest loop size that
will achieve speed up
In practice, we need roughly 5000 floating point
operations in a loop for it to be worth parallelizing
A good rule of thumb is that any thread should have at
least 1000 floating point operations
Thus small loops are simply not worth the bother!
Summary

Shared memory parallel computers can be programmed using
the OpenMP extensions to C,FORTRAN


The easiest way to use OpenMP is to make loops parallel by
dividing work up among threads




Distributed memory computers require a different parallel language
Compiler handles most of the difficult parts of coding
However, not all loops are immediately parallelizable
Dependencies may prevent parallelization
Loops are made to run in parallel by adding directives
(“pragmas”) to your code

These directives appear to be comments to ordinary compilers
Next Lecture

More details on dependencies and how we can
deal with them