Transcript Sani Paper

Lecture 5:
Part 1
Performance Laws:
Speedup and Scalability
1
Sequential Execution Time
Execution time = Seconds
(Te)
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
500 MHz CPU: 500 x 106 clock cycles/sec
My program consists of operations:
addition: 500 x 106 (1 cycle per inst)
mul: 180 x 106 (3 cycles/inst)
div: 120 x 106 (2 cycles/inst)
data move: 300 x 106 (1 cycle/inst)
Expected execution time:
2 ns x (500x1+180x3+120x2+300x1)x 106
2
Basic Performance Metrics








MIPS = (instructions/second)x 10-6
MFLOPS = (floating point ops/second)x 10-6
CPI = Average cycles per instruction
Throughput: number of results per second
Workload: W, number of Ops. required to
complete the program
Speed: W/TE
Speedup (S)= Te / Timprove
Efficiency (using P processors) = Speedup / P
3
Part I: Speedup
4
Speedup

General concepts of Speedup in parallel
computing:



How much faster an application runs on parallel
computer ?
What benefits derive from the use of parallelism?
General agreement :

speedup = serial time/parallel time
5
Speedup



Fixed-Workload Speedup (Amdahl’s Law)
Fixed-Time Speedup (Gustafson’s Law)
Fixed-Memory Speedup (Sun and Ni)
6
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
F/S
F
1/S
Suppose that enhancement E accelerates a
fraction F of the task by a factor S, and the
remainder of the task is unaffected
ExTime: execution time
7
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
8
Amdahl’s Law

Floating point instructions improved to run
2X; but only 10% of actual instructions are
FP
ExTimenew =
Speedupoverall =
9
Amdahl’s Law

Floating point instructions improved to run
2X; but only 10% of actual instructions are
FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
=
1.053
10
Amdahl’s Law



Some applications need real-time response
Amdahl’s law shows the upper bound of the
achievable speedup for a given problem size.
Limit on the achievable speedup:

W: total workload
  of W must be executed sequentially

Sp = W / ( W+(1-  )(W/P))
= P / (1+(P-1) )  1
/  as P 
11
Fixed-Time Speedup




Gustafson’s Law (1988)
Scaling the problem size along with the increase
of machine size within the same execution time
Scaling for higher accuracy
Parameters:



W: workload done by a single node
W’ =  W+(1-  )WP = work done by P nodes
Sp = ( W+(1-  )WP) / W
=  +(1-)P  Speedup is linear function of P
12
Fixed-Memory Speedup




Sun and Ni’s Law: Memory bounding (1993)
To solve the largest possible problem, limited only
by the available memory space.
As P increases, use up all the increased memory
by scaling the problem size also
See Kai’s book (Section 3.6.3)
13
More on Speedup

Can you measure the sequential time?




What if the memory is too small ?
Is the sequential machine using the same
processor as the one in parallel machine ?
Can the sequential time be shorter than the
parallel time ?
Is speedup really important ? Or only the
execution time is important ?
14
More on Speedup

Speedup larger than “P” (Superlinear) ?




What if the disk swapping effects happen in
sequential execution?
More caches/memory used in the parallel
machines (PxC, PxM)
Both sequential and parallel machines use the
same OS/compiler ?
Is the data distribution/collection time included in
the parallel time?

Some parallel machines provide parallel I/O !!
15
More on Speedup

What if we use different algorithms for sequential
and parallel solutions

Take a look at the parallel sorting assignment.
What is the maximum speedup for parallel sorting?
16
Definitions of Speedup

Diverse definitions of serial and parallel execution
times [Sahni:1996]





Relative
Real
Absolute
Asymptotic
Asymptotic relative
17
Parameters Used in Speedup
Definitions
I
= problem instance
 P = number of processors
 Q = parallel program
 n = size of I
18
Relative Speedup
 serial
time : execution time of the
parallel program on a single node of
the parallel computer. (mpirun -np 1)
 Relative speedup (I,P) = time to solve I
using program Q and 1 processor ÷
time to solve I using Q program and P
processors
19
Relative Speedup
 Depends
on the characteristics of the
instance I being solved as well as the
size P of the parallel computer
 Same OS, same node architecture,
using same compiler
 Extra overheads in serial time
(distribution/collection).
20
Real Speedup
 Parallel
time vs. the fastest serial
algorithm or program running on a
single node of the parallel computer
 Real speedup (I,P) = time to solve I
using best serial program and 1
processor ÷ time to solve I using Q
program and P processors
21
Problems on Real Speedup
fastest algorithm might not be
known.
 No single algorithm might be fastest in
all instances for some applications
 In practice, we use the runtime of the
most frequently used sequential
algorithm.
 The
22
Absolute Speedup
 Parallel
time vs. the fastest sequential
algorithm run on the fastest serial
computer
 Absolute speedup (I,P) = time to solve I
using best serial program and 1
fastest processor ÷ time to solve I
using Q program and P processors
23
Absolute Speedup
 Can
also use the sequential algorithm
most often used in practice.
 Time-variant: researchers keep
designing new algorithms
 Speedup could be less than 1
24
Asymptotic Real Speedup
Compares the execution time of the best
serial algorithm for a particular problem
with the asymptotic complexity of the
parallel algorithm
 Assuming the the parallel computer has all
the processors it can use

25
Asymptotic Real Speedup
 Asymptotic real
speedup (n)
=
asymptotic complexity of best serial
algorithm ÷ asymptotic complexity of
Q using as many processors as
needed
26
Asymptotic Real Speedup
 For
problem such as sorting where the
asymptotic complexity is not uniquely
characterized by the instance size n,
the worst-case complexity is used -O(n2), O(n log n)
 Unbounded number of processors.
27
Asymptotic Relative Speedup
 Uses
the asymptotic time complexity
of the parallel algorithm when run on a
single processor.
 Asymptotic relative speedup =
asymptotic complexity of Q using 1
processor ÷ asymptotic complexity of
Q using as many processors as
needed
28
Asymptotic Relative Speedup

Matrix Multiplication (n X n)




Serial time: O(n3)
Parallel time (on n3/log n processors
hypercube): O(log n) time
Asymptotic Relative Speedup: O(n3/log n)
Others are measured Speedup



Relative
Real
Absolute
29
Part II:
Scalability
30
Scalability

Algorithmic Scalability:
 the
available parallelism increases at
least linearly with problem size.

Architectural (Size) Scalability:
 the
architecture continues to yield the
same performance per processor, as the
number of processors is increased and
as the problem size is increased.
31
Architectural Scalability
Processor Speed
No. of issues, depth
of pipelining
Memory Subsystem:
cache/memory
speed
OS support
(lock, processes
synchronization
overheads)
Interconnection Network:
latency/bandwidth
Architectural
scalability
I/O performance
Performance grows in all aspects
32
Discussion of Architectural
Scalability: bus-based SMPs

Bus (single set of wires) : fixed bandwidth shared
by all processors (P)



Physical constraints: fixed number of slots
Electrical constraints:




Bandwidth accessible by each processor
decreases as P increases
bus loading, wire length determine frequency
power
Cost scaling: X (bus is the core)
OS scaling : scheduling, locking, I/O
33
Scalable Interconnection Network




Bandwidth (P)?
Latency (P)?
Cost (P)?
More bandwidth,
smaller L  greater
cost


Network (bus ? multistage,
mesh, torus)
PE
PE
PE
PE
SGI XL: 1.5 M HK$
(1.2 GB/s)
ALR SMP: 0.1 M HK$
(533 MB/s)
34
Bottom Line


There is NO “truly scalable” machine
Goal is to design for a given “range of scale”


one (instruction level)
two to tens (SMP):




Top-end: Sun Enterprise 1000 : 64 processors, COMPACQ 64
Alpha processors
tens to a few thousands (Distributed-Memory Multicomputers:
T3E, SP2, Paragon, ...)
tens of thousands (ASCI machines: cluster of SMPs)
Techniques at one scale may not be cost effect at another
35
Bottom Line

Even though we can meet the engineering requirements
to physically scale over the range of interest, we must
ensure that the communication and synchronization
operations required to support target programming
models also scale and are cost effective.
36