FIT5174-2013

Download Report

Transcript FIT5174-2013

FIT5174 Distributed & Parallel
Systems
Lecture 7
Parallel Computer System Architectures
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
1
Acknowledgement
These slides are based on slides and material by:
Carlo Kopp
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
2
Parallel Computing
•
•
Parallel computing is a form of computation in which many instructions
are carried out simultaneously
It operates on the principle that large problems can often be divided into
smaller ones, which are then solved concurrently (i.e. at the same time)
Serial computing
•
Parallel computing
There are several different forms of parallel computing: bit-level
parallelism, instruction-level parallelism, data parallelism, and task
parallelism.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
3
Parallel Computing
• Contemporary computer applications require the processing of
large amounts of data in sophisticated ways. Example include:
 parallel databases, data mining
 oil exploration
 web search engines, web based business services
 computer-aided diagnosis in medicine
 management of national and multi-national corporations
 advanced graphics and virtual reality, particularly in the
entertainment industry
 networked video and multi-media technologies
 collaborative work environments
• Ultimately, parallel computing is an attempt to minimise time
required to compute a problem, despite the performance
limitations of individual CPUs / cores.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
4
Parallel Computing Terminology
• There are different ways to classify parallel computers.
One of the more widely used classifications, in use
since 1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor
computer architectures according to two independent
dimensions of Instruction and Data. Each of these
dimensions can have only one of two possible states:
Single or Multiple.
• The 4 possible classifications according to Flynn.
 S I S D : Single Instruction, Single Data
 S I M D : Single Instruction, Multiple Data
 M I S D : Multiple Instruction, Single Data
 M I M D : Multiple Instruction, Multiple Data
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
5
Concepts and Terminology
• At the executable machine code level, programs are seen by the
processor or core as a series of machine instructions, in some
machine specific binary code;
• The common format of any instruction is that of an “operation code”
or “opcode” and some “operands’, which are arguments the
processor/core can understand;
• Typically, operands are held in registers in the processor/core
which store several bytes of data, or memory addresses pointing to
locations in the machine’s main memory;
• In a “conventional” or “general purpose” processor/core a single
instruction combines one opcode with two or three operands, e.g.
ADD R1, R2, R3 – add contents of R1 and R2, put result into R3
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
6
Flynn’s Classification
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
7
Flynn’s Classification - SISD
•
Single Instruction, Single Data (SISD):
 A serial (non-parallel or “conventional”) computer
 Single instruction: only one instruction stream is
being acted on by the CPU during any one clock
cycle
 Single data: only one data stream is being used as
input during any one clock cycle
 Deterministic execution
 This is the oldest and until recently, the most
prevalent form of computer
 Examples: most PCs, single CPU workstations and
mainframes
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
8
Flynn’s Classification - SIMD
Single Instruction, Multiple Data (SIMD):
• A type of parallel computer
• Single instruction: All processing units execute the same instruction at
any given clock cycle
• Multiple data: Each processing unit can operate on a different data
element
• This type of machine typically has an instruction dispatcher, a very highbandwidth internal network, and a very large array of very small-capacity
instruction units.
• Best suited for specialized problems characterized by a high degree of
regularity, such as image processing, matrix algebra etc.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
 Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
 Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi
S820
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
9
Flynn’s Classification - SIMD
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
10
Flynn’s Classification - MISD
Multiple Instruction, Single Data (MISD):
• A single data stream is fed into multiple processing
units.
• Each processing unit operates on data independently
via independent instruction streams.
• Few actual examples of this class of parallel computer
have ever existed. One was the experimental CarnegieMellon computer
• Some conceivable uses might be:
 multiple frequency filters operating on a single signal
stream
 multiple cryptography algorithms attempting to crack a
single coded message.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
11
Flynn’s Classification - MIMD
Multiple Instruction, Multiple Data (MIMD):
• Currently, the most common type of parallel computer. Most
modern computers fall into this category.
• Multiple Instruction: every processor may be executing a different
instruction stream
• Multiple Data: every processor may be working with a different
data stream
• Execution can be synchronous or asynchronous, deterministic or
non-deterministic
• Examples: most current supercomputers, networked parallel
computer "grids" and multi-processor SMP computers - including
some types of PCs.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
12
Parallel Computer Memory Architectures
• Broadly divided into three categories
– Shared memory
– Distributed memory
– Hybrid
Shared Memory
• Shared memory parallel computers vary widely, but generally have
in common the ability for all processors to access all memory as
global address space.
• Multiple processors can operate independently but share the same
memory resources.
• Changes in a memory location effected by one processor are
visible to all other processors.
• Shared memory machines can be divided into two main classes
based upon memory access times: UMA and NUMA;
• Uniform Memory Access vs Non-Uniform Memory Access models.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
13
Parallel Computer - Shared Memory
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
14
Parallel Computer - Distributed Memory
Distributed Memory
•
Distributed memory systems require a communication network to connect
inter-processor memory.
•
Processors have their own local memory. There is no concept of global
address space across all processors.
•
Because each processor has its own local memory, it operates
independently. Changes it makes to its local memory have no effect on the
memory of other processors. Hence, the concept of “cache coherency”
does not apply.
•
When a processor needs access to data in another processor, it is usually
the task of the programmer to explicitly define how and when data is
communicated. Synchronization between tasks is likewise the
programmer's responsibility.
•
The network “fabric” used for data transfers varies widely, though it can be
as simple as Ethernet, or as complexed as a specialised bus or switching
device.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
15
Parallel Computer - Distributed Memory
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
16
Parallel Computer - Hybrid Memory
Hybrid: The largest and fastest computers in the world today employ
both shared and distributed memory architectures.
• The shared memory component is usually a cache coherent SMP
machine. Processors on a given SMP can address that machine's
memory as global.
• The distributed memory component is the networking of multiple
SMPs. SMPs know only about their own memory - not the memory
on another SMP. Therefore, network communications are required to
move data from one SMP to another.
• Current trends seem to indicate that this type of memory architecture
will continue to prevail and increase at the high end of computing for
the foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared
and distributed memory architectures.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
17
Parallel Computer - Hybrid Memory
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
18
Parallel Programming Models
Overview
• There are several parallel programming models in common use:
– Shared Memory
– Threads
– Message Passing
– Data Parallel
– Hybrid
• Parallel programming models exist as an abstraction above
hardware and memory architectures.
• Although it might not seem apparent, these models are NOT
specific to a particular type of machine or memory architecture. In
fact, any of these models can (theoretically) be implemented on any
underlying hardware.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
19
Parallel Computing Performance
• General Speed-up formula
Speedup 
Sequential
execution
Parallel execution
time
time
• Execution time components
Inherently sequential computations: (n)
Potentially parallel computations: (n)
Communication operations: (n,p)
 (n)   (n)
 (n, p ) 
 (n)   (n) / p   (n, p )
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
20
Speed-up Formula
(n)/p
(n,p)
Computations
Comps + Comms
Speed-up
(n)/p + (n,p)
FIT5174 Parallel & Distributed Systems
Communication
s
Dr. Ronald Pose
Lecture 7 - 2013
21
AmDahl’s Law of Speed-up
• It states that a small portion of the program which cannot be
parallelized will limit the overall speed-up available from
parallelization.
• Any large mathematical or engineering problem will typically
consist of several parallelizable parts and several nonparallelizable (sequential) parts. This relationship is given by
the equation:
where S is the speed-up of the program (as a factor of its
original sequential runtime), and P is the fraction that is
parallelizable.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
22
Interesting Amdahl Observation
• If the sequential portion of a program is 10% of the
runtime, we can get no more than a 10 x speed-up,
regardless of how many processors are added.
• This puts an upper limit on the usefulness of adding
more parallel execution units.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
23
Amdahl’s Law
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
24
Parallel Efficiency
• Efficiency
Speedup
Efficiency 
Processors
• 0  (n,p)  1
• Amdahl’s law
 (n, p ) 
 (n, p ) 

 (n)   (n)
p (n)   (n)  p (n, p )
 (n)   (n)
 (n)   (n) / p   (n, p )
 (n)   (n)
 (n)   (n) / p
• Let f = (n)/((n) + (n)); i.e., f is the fraction of the
code which is inherently sequential
1
 
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
f  (1  f ) / p
Lecture 7 - 2013
25
Examples
• 95% of a program’s execution time occurs inside a loop that can be
executed in parallel. What is the maximum speedup we should
expect from a parallel version of the program executing on 8
CPUs?
 
1
0 . 05  (1  0 . 05 ) / 8
 5 .9
• 20% of a program’s execution time is spent within inherently
sequential code. What is the limit to the speedup achievable by a
parallel version of the program?
lim
p
1
0 . 2  (1  0 . 2 ) / p
FIT5174 Parallel & Distributed Systems

Dr. Ronald Pose
1
5
0 .2
Lecture 7 - 2013
26
Amdahl’s Law limitations
Limitations of Amdahl’s Law
• Ignores (n,p) - overestimates speedup
• Assumes f constant, so underestimates speedup achievable
Speedup
n = 10,000
Amdahl Effect
•
•
•
•
Typically (n,p) has lower complexity than (n)/p
As n increases, (n)/p dominates (n,p)
As n increases, speedup increases
As n increases, sequential fraction f decreases.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
n = 1,000
n = 100
Processors
Lecture 7 - 2013
27
Gustafson’s Law
• Gustafson's Law (also known as Gustafson-Barsis' law, 1988)
states that any sufficiently large problem can be efficiently
parallelized.
• Gustafson's Law is closely related to Amdahl's law, which gives a
limit to the degree to which a program can be sped up due to
parallelization.
S(P) = P − α * (P − 1).
where P is the number of processors, S is the speedup, and α
the non-parallelizable part of the process
• Gustafson's law addresses the shortcomings of Amdahl's law,
which cannot scale to match availability of computing power as the
machine size increases.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
28
Gustafson’s Law
Also,
• It removes the fixed problem size or fixed computation
load on the parallel processors: instead, he proposes a
fixed time concept which leads to scaled speed up.
• Amdahl's law is based on fixed workload or fixed
problem size. It implies that the sequential part of a
program does not change with respect to machine size
(i.e, the number of processors). However the parallel
part is evenly distributed by n processors.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
29
Performance Summary
• Performance terms
– Speedup
– Efficiency
• What prevents linear speedup?
– Serial operations
– Communication operations
– Process start-up
– Imbalanced workloads
– Architectural limitations
• Analyzing parallel performance
– Amdahl’s Law
– Gustafson-Barsis’ Law
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
30
Parallel Programming Examples
• This example demonstrates calculations on 2-dimensional
array elements, with the computation on each array element
being independent from other array elements.
• The serial program calculates one element at a time in
sequential order.
• Serial code could be of the form:
do j = 1, n
do i = 1, m
a(i,j) = fcn(i,j)
end do
end do
• The calculation of elements is independent of one
another - leads to an embarrassingly parallel situation.
• The problem should be computationally intensive
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
31
Parallel Programming -2D example
• Arrays elements are distributed so that each processor owns
a portion of an array (subarray).
• Independent calculation of array elements insures there is no
need for communication between tasks.
• Distribution scheme is chosen by other criteria, e.g. unit stride
(stride of 1) through the subarrays. Unit stride maximizes
cache/memory usage.
• After the array is distributed, each task executes the portion of
the loop corresponding to the data it owns. For example:
do j = mystart, myend
do i = 1,m
a(i,j) = fcn(i,j)
end do
end do
• Notice that only the outer loop variables are different
from the serial solution.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
32
Pseudo-code
find out if I am MASTER or WORKER
if I am MASTER
initialize the array
send each WORKER info on part of array it owns
send each WORKER its portion of initial array
receive from each WORKER results
else if I am WORKER
receive from MASTER info on part of array I own
receive from MASTER my portion of initial array
# calculate my portion of array
do j = my_first_column, my_last_column
do i = 1,n
a(i,j) = fcn(i,j)
end do
end do
send MASTER results
endif
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
33
Pi Calculation : Serial solution
PI Calculation
• The value of PI can be calculated in a
number of ways. Consider the following
method of approximating PI
– Inscribe a circle in a square
– Randomly generate points in the
square
– Determine the number of points in
the square that are also in the circle
– Let r be the number of points in the
circle divided by the number of
points in the square
– PI ~ 4 r
– Note
that
the
more
points
generated,
the
better
the
approximation
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
34
Pi Calculation : Serial solution
•
Serial pseudo code for this procedure:
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
•
•
Note that most of the time in running this program would be spent
executing the loop
Leads to an embarrassingly parallel solution
– Computationally intensive
– Minimal communication
– Minimal I/O
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
35
Pi Calculation : Parallel Solution
• Parallel Solution
• Parallel strategy: break the loop into portions that can be executed by the
tasks.
• For the task of approximating Pi:
– Each task executes its portion of the loop a number of times.
– Each task can do its work without requiring any information from the
other tasks (there are no data dependencies).
– Uses the SPMD** model. One task acts as master and collects the
results.
• Pseudo code solution: red highlights changes for parallelism.
[**SPMD: (Single Process, Multiple Data) or (Single Program, Multiple Data)
Tasks are split up and run simultaneously on multiple processors with different input in
order to obtain results faster. SPMD is the most common style of parallel programming. It is
a subcategory of MIMD of Flynn’s Taxonomy]
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
36
Pi Calculation : Parallel Solution
pseudocode
npoints = 10000
circle_count = 0
p = number of tasks
num = npoints/p
find out if I am MASTER or WORKER
do j = 1,num
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
if I am MASTER
receive from WORKERS their circle_counts
compute PI (use MASTER and WORKER calculations)
else if I am WORKER
send to MASTER circle_count
endif
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
37
1-D Wave Equation Parallel Solution
• Implement as an SPMD model
• The entire amplitude array is partitioned and distributed as
sub-arrays to all tasks. Each task owns a portion of the total
array.
• Load balancing: all points require equal work, so the points
should be divided equally
• A block decomposition would have the work partitioned into
the number of tasks as chunks, allowing each task to own
mostly contiguous data points.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
38
1-D Wave Equation Parallel Solution
• Communication need only occur on data borders. The larger
the block size the less the communication.
• The equation to be solved is the one-dimensional wave
equation:
A(i, t+1) = (2.0 * A(i, t)) - A(i, t-1) + (c * (A(i-1, t) - (2.0 * A(i, t)) +
A(i+1, t)))
where c is a constant
• Note that amplitude will depend on previous timesteps (t, t-1)
and neighboring points (i-1, i+1). Data dependence will mean
that a parallel solution will involve communications.
FIT5174 Parallel & Distributed Systems
Dr. Ronald Pose
Lecture 7 - 2013
39
1-D Wave Equation Parallel Solution
find out number of tasks and task identities
#Identify left and right neighbors
left_neighbor = mytaskid – 1; right_neighbor = mytaskid +1
if mytaskid = first then left_neigbor = last
if mytaskid = last then right_neighbor = first
find out if I am MASTER or WORKER
if I am MASTER
initialize array ; send each WORKER starting info and subarray
else if I am WORKER
receive starting info and subarray from MASTER
endif
#Update values for each point along string
#In this example the master participates in calculations
do t = 1, nsteps
send left endpoint to left neighbor ; receive left endpoint from
right neighbor
send right endpoint to right neighbor ; receive right endpoint from
left neighbor
#Update points along line
do i = 1, npoints
newval(i) = (2.0 * values(i)) - oldval(i) + (sqtau * (values(i-1) (2.0 * values(i)) + values(i+1)))
end do
end do
#Collect results and write to file
if I am MASTER
receive results from each WORKER write results to file
else if I am WORKER
send results to MASTER
Dr. Ronald Pose
Lecture 7 - 2013
endif FIT5174 Parallel & Distributed Systems
40