Parallel-Architectures-Performence-Analysis
Download
Report
Transcript Parallel-Architectures-Performence-Analysis
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
Parallel computer: multiple-processor system
supporting parallel programming.
Three principle types of architecture
Vector computers, in particular processor arrays
Shared memory multiprocessors
Specially designed and manufactured systems
Distributed memory multicomputers
Message passing systems readily formed from a cluster of
workstations
Parallel Architectures and Performance Analysis – Slide 2
Vector computer: instruction set includes
operations on vectors as well as scalars
Two ways to implement vector computers
Pipelined vector processor (e.g. Cray): streams data
through pipelined arithmetic units
Processor array: many identical, synchronized
arithmetic processing elements
Parallel Architectures and Performance Analysis – Slide 3
Natural way to extend single processor model
Have multiple processors connected to multiple
memory modules such that each processor can access
any memory module
So-called shared memory configuration:
Parallel Architectures and Performance Analysis – Slide 4
Parallel Architectures and Performance Analysis – Slide 5
Type 2: Distributed Multiprocessor
Distribute primary memory among processors
Increase aggregate memory bandwidth and lower
average memory access time
Allow greater number of processors
Also called non-uniform memory access (NUMA)
multiprocessor
Parallel Architectures and Performance Analysis – Slide 6
Parallel Architectures and Performance Analysis – Slide 7
Complete computers connected through an
interconnection network
Parallel Architectures and Performance Analysis – Slide 8
Distributed memory multiple-CPU computer
Same address on different processors refers to different
physical memory locations
Processors interact through message passing
Commercial multicomputers
Commodity clusters
Parallel Architectures and Performance Analysis – Slide 9
Parallel Architectures and Performance Analysis – Slide 10
Parallel Architectures and Performance Analysis – Slide 11
Parallel Architectures and Performance Analysis – Slide 12
Michael Flynn (1966) created a
classification for computer architectures
based upon a variety of characteristics,
specifically instruction streams and data
streams.
Also important are number of
processors, number of programs which
can be executed, and the memory
structure.
Parallel Architectures and Performance Analysis – Slide 13
Control unit
Control Signals
Arithmetic
Processor
Results
Instruction
Memory
Data Stream
Parallel Architectures and Performance Analysis – Slide 14
Control Unit
Control Signal
PE 1
PE 2
Data Stream 1 Data Stream 2
PE n
Data Stream n
Parallel Architectures and Performance Analysis – Slide 15
Control
Unit 1
Instruction Stream 1
Control
Unit 2
Instruction Stream 2
Control
Unit n
Instruction Stream n
Processing
Element 1
Processing
Element 2
Data
Stream
Processing
Element n
Parallel Architectures and Performance Analysis – Slide 16
S1
S2
S3
S4
S1
S2
S3
S4
Serial execution of two processes with 4 stages each.
Time to execute T = 8 t , where t is the time to execute
one stage.
S1
S2
S3
S4
S1
S2
S3
S4
Pipelined execution of
the same two
processes.
T=5t
Parallel Architectures and Performance Analysis – Slide 17
Control
Unit 1
Instruction Stream 1
Control
Unit 2
Instruction Stream 2
Control
Unit n
Instruction Stream n
Processing
Element 1
Processing
Element 2
Processing
Element n
Data Stream 1
Data Stream 2
Data Stream n
Parallel Architectures and Performance Analysis – Slide 18
Multiple Program Multiple Data (MPMD) Structure
Within the MIMD classification, which we are
concerned with, each processor will have its own
program to execute.
Parallel Architectures and Performance Analysis – Slide 19
Single Program Multiple Data (SPMD) Structure
Single source program is written and each processor
will execute its personal copy of this program,
although independently and not in synchronism.
The source program can be constructed so that parts of
the program are executed by certain computers and
not others depending upon the identity of the
computer.
Software equivalent of SIMD; can perform SIMD
calculations on MIMD hardware.
Parallel Architectures and Performance Analysis – Slide 20
Architectures
Vector computers
Shared memory multiprocessors: tightly coupled
Centralized/symmetrical multiprocessor (SMP): UMA
Distributed multiprocessor: NUMA
Distributed memory/message-passing multicomputers:
loosely coupled
Asymmetrical vs. symmetrical
Flynn’s Taxonomy
SISD, SIMD, MISD, MIMD (MPMD, SPMD)
Parallel Architectures and Performance Analysis – Slide 21
A sequential algorithm can be evaluated in terms of its
execution time, which can be expressed as a function
of the size of its input.
The execution time of a parallel algorithm depends not
only on the input size of the problem but also on the
architecture of a parallel computer and the number of
available processing elements.
Parallel Architectures and Performance Analysis – Slide 22
The speedup factor is a measure that captures the
relative benefit of solving a computational problem in
parallel.
The speedup factor of a parallel computation utilizing
p processors is defined as the following ratio:
In other words, S(p) is defined as the ratio of the
sequential processing time to the parallel processing
time.
Parallel Architectures and Performance Analysis – Slide 23
Speedup factor can also be cast in terms of
computational steps:
Maximum speedup is (usually) p with p processors
(linear speedup).
Parallel Architectures and Performance Analysis – Slide 24
Given a problem of size n on p processors let
Inherently sequential computations (n)
Potentially parallel computations
Communication operations
(n)
(n,p)
Then:
Parallel Architectures and Performance Analysis – Slide 25
Computation Time
Communication Time
“elbowing out”
Number of processors
Parallel Architectures and Performance Analysis – Slide 26
The efficiency of a parallel computation is defined as a
ratio between the speedup factor and the number of
processing elements in a parallel system:
E
Exec. time using one processor
p Exec. time using p processors
Ts
p Tp
S ( p)
p
Efficiency is a measure of the fraction of time for
which a processing element is usefully employed in a
computation.
Parallel Architectures and Performance Analysis – Slide 27
Since E = S(p)/p, by what we did earlier
Since all terms are positive, E > 0
Furthermore, since the denominator is larger than the
numerator, E < 1
Parallel Architectures and Performance Analysis – Slide 28
Parallel Architectures and Performance Analysis – Slide 29
As before
since the communication time must be non-trivial.
Let f represent the inherently sequential portion of the
computation; then
Parallel Architectures and Performance Analysis – Slide 30
Limitations
Ignores communication time
Overestimates speedup achievable
Amdahl Effect
Typically (n,p) has lower complexity than (n)/p
So as p increases, (n)/p dominates (n,p)
Thus as p increases, speedup increases
Parallel Architectures and Performance Analysis – Slide 31
As before
Let s represent the fraction of time spent in parallel
computation performing inherently sequential
operations; then
Parallel Architectures and Performance Analysis – Slide 32
Then
Parallel Architectures and Performance Analysis – Slide 33
Begin with parallel execution time instead of
sequential time
Estimate sequential execution time to solve same
problem
Problem size is an increasing function of p
Predicts scaled speedup
Parallel Architectures and Performance Analysis – Slide 34
Both Amdahl’s Law and Gustafson-Barsis’ Law ignore
communication time
Both overestimate speedup or scaled speedup
achievable
Gene Amdahl
John L. Gustafson
Parallel Architectures and Performance Analysis – Slide 35
Performance terms: speedup, efficiency
Model of speedup: serial, parallel and communication
components
What prevents linear speedup?
Serial and communication operations
Process start-up
Imbalanced workloads
Architectural limitations
Analyzing parallel performance
Amdahl’s Law
Gustafson-Barsis’ Law
Parallel Architectures and Performance Analysis – Slide 36
Based on original material from
The University of Akron: Tim O’Neil, Kathy Liszka
Hiram College: Irena Lomonosov
The University of North Carolina at Charlotte
Barry Wilkinson, Michael Allen
Oregon State University: Michael Quinn
Revision history: last updated 7/28/2011.
Parallel Architectures and Performance Analysis – Slide 37