Introduction To The Numerical Solution Of Ordinary

Transcript Introduction To The Numerical Solution Of Ordinary

Analytical Modeling Of
Parallel Programs
Dagoberto A.R.Justo
PPGMAp
UFRGS
7/18/2015
1
intro
Introduction
•
•
Problem: How can we model the behavior of a
parallel program in order to predict its execution
time, using the size of the problem, the number of
nodes/processors, the communication network
parameters ts and tw?
• Clearly, we must consider the algorithm and the
architecture
Issues
• A serial program measures total execution time and
typically depends mainly the size of its input
• A parallel program has several new issues
• The execution time of the program
• The speedup relative to the algorithm running serially
• However, is it speedup compared with the best serial
algorithm or is it speed relative to the serialization of the
parallel algorithm used?
2
Outline Of This Topic
• Sources of overhead in a parallel program
• Performance metrics for parallel systems
• The effect of granularity on performance
• Scalability of parallel systems
• Minimum execution time and minimum cost-
•
•
optimal execution time
Asymptotic analysis of parallel programs
Other scalability metrics
3
Typical Parallel Execution Profile
P0
P1
P2
P3
P4
P5
P6
P7
Essential/Excess Computation
Interprocessor Communication
Idling
4
Sources Of Overhead
•
A profile illustrates the kinds of activities in a
program execution
•
•
Essential Computation
• Same computations in a serial program
Some of these are overheads (time spent not
computing directly what is needed) such as:
• Interprocess interaction
• Idling
• Excess computation
• These are all activities the serial program does not perform
• An efficient parallel program attempts to minimize
these overheads to zero but of course this is not
always possible
5
Interprocess Interaction and Idling
• Interprocess interaction
• Usually the most significant overhead
• Sometimes reduced by performing redundant
computation
• Idling
• Caused by:
• load imbalance
• synchronization (waiting for collaborating
processes to reach the synchronization point
• serial computation that cannot be avoided
6
Excess Computation
•
7
The fastest known serial algorithm may not be easy to
parallelize, especially for a large number of processes
• A different serial algorithm may be necessary, which is not
optimal when implemented serially
•
• Such programs may perform more work
• It may be faster for all processes to compute common
intermediate results than to compute them once and broadcast
them to all processes that need them
What we are thus faced with is how to measure the
performance of a parallel program to tell whether the
parallel program is worth using
• How does performance compare with a serial implementation?
• How does performance scale with adding more processes?
• How does performance scale with increasing the problem size?
Performance Metrics
•
•
•
Serial and Parallel runtime
• TS : wall-clock time from start time to completion time (on the
same processor as the parallel program)
• TP : wall-clock time from the time the parallel processing starts to
the end time for the last process to complete
• TTP = pTP : Parallel Cost
Total parallel overhead
• TO = pTP – TS
Speedup S (How well is the parallel program performing?)
• The ratio of the parallel execution time with the execution of the
best serial program
S = TS / TP
• S is expect to be near p
• S=p is called linear speedup
• S<p most often. Generally, 80% of p is very good
• S>p, it is called superlinear speedup
8
Speedup and Efficiency
•
•
Speedup S (How well is the parallel program performing?)
• The ratio of the parallel execution time with the execution of the
best serial program
S = TS / TP
• S is expect to be near p
• S=p is called linear speedup
• S<p most often. Generally, 80% of p is very good
• S>p, it is called superlinear speedup
Efficiency
• The ratio of the speedup to the number of processors used
• E = S/p
• E=1 is ideal
• E>0.8 is generally good
9
Example: Summing n Numbers in
Parallel With n Processors
•
Using input decomposition, place each number on
a different processor
• The sum is performed in log n phases
• Assume n is a power of 2 and assume the processors are
arranged in a linear array numbered from 0
• Phase 1:
• Odd numbered processors send xi to the left processor
• Even processor adds the two numbers Si=xi+xi+1
• Phase 2:
• Every second processor sends Si to the processor two
positions left
• This processor adds the partial sum it has to the received
partial sum
• Continuing for log n phases, process 0 ends up with the
sum
10
A Picture Of The Parallel Sum Process
11
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(a) Initial data distribution and the first communication phase


1
0
0
1

3
2
2
3

5
4
4
5

7
6
6
7

9
8
8
9

11
10
10
11

13
12
12
15
14
13
14
15
(b) Second communication step


3
0
0
1
2
3

7

11
4
4
5
6
7
8
8
9
10
15
12
11
12
13
14
15
11
12
13
14
15
(c) Third communication step


7
0
0
1
2
3
4
5
6
7
15
8
8
9
10
(d) Fourth communication step

0
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(e) Accumulation of the sum at processor 0 after the final communication
15
Modelling execution time
•
•
•
•
•
tC : the time to add two numbers
Serial time TS = (n-1) tC = (n)
Parallel time TP: in each of the log n phases,
•
•
computation time: tC
communication time: (ts + tw)
TP = tC log n + (ts + tw) log n = (log n)
The speedup is:
S = TS / TP = (tC n)/(tC log n + (ts + tw) log n)
= (tC /(tC + ts + tw)(n log n) = (n / log n)
The overhead (p = n) is:
TO = pTp – TS = n((ts + tw) log n + tC(log n – 1)) + tC
= (n log n)
•
• The overhead is large -- why?
Thus, this parallel algorithm does considerably more
total work than the serial algorithm
12
Misleading Speedups
•
Faster than one deserves:
• Consider a parallel implementation of bubble sort, called
the odd-even sort, of 105 elements on 4 processors that
completes in 40 seconds
• The serial version of bubble sort takes 150 seconds
• But serial quick-sort on the same set of elements takes 30
seconds
•
• The misleading speedup is: TS / TP = 150/40 = 3.75
• The correct speedup is: TS / TP = 30/40 = 0.75
• The required comparison is with quick-sort, the fastest
serial algorithm and the particular parallel sort algorithm
But the parallel code can run faster than expected
•
For example:
• Because of caches effects
• Because of exploratory decomposition
13
Misleading Speedups Continued
• Cache effects
• Consider a problem of size W words(large
but small enough to fit in the memory of a
single processor)
• Suppose on a single processor an 80% cache hit ratio
is observed for this problem
• Suppose cache latency to CPU is 2 ns
• Suppose the DRAM latency is 100 ns
• The access time per data item on average is then
0.8*2 + 0.2*100 = 21.6 ns
• Assume the program performs at 1 floating
operation per memory access
• Then the performance is 1000/21.6 Mflops or 46.3
MFLOPS
14
Cache Effects Continued
•
Consider solving the same problem on two
processors by data decomposition of all data so
that two sub-problems of size W/2 are solved
• The amount of data per processor is less so that we might
expect the cache hit ratio to be higher, say 90%
• Of the remaining 10% of the accesses, assume 8% are from
each processor's memory and 2% are from the other
processor's memory
• Suppose the latencies are 2 ns, 100 ns, and 400 ns
respectively (400 ns for access to the DRAM of the other
processors memory)
• The access time per data item on average is then 0.9*2 +
0.08*100 + 0.02*400 = 17.8 ns per processor
• Assume, as before, the program performs 1 floating operation
per memory access
• Then the performance is 1000/17.8 Mflops or 56.2 MFLOPS
per processor for a total rate of 112.4 MFLOPS (2 processors
used)
• Speedup is then 112.6/46.3 or 2.43, faster than we deserve
15
Faster Solutions Via Exploratory Decomposition
•
16
Suppose the search tree looks like the following graph with
the solution at the rightmost node
Processing
element 0 begins
searching here
Processing
element 1 begins
searching here
Solution
• Serial search is depth first, left first
• Parallel search is 2 processors
• Both depth first, left first algorithms
• The serial algorithm finds the solution in 14 steps
• The parallel algorithm takes 5 steps
• The speedup is 14/5 = 2.8 > 2 -- superlinear speedup results
Efficiency
•
17
S
Efficiency E is defined as:
p
• S: speedup and p: number of processors
E
• It in essence measures the average utilization of
all processors
• E=1: Ideal speedup
• E>1: Superlinear speedup
• E>0.8: Very good speedups in practice
• Example: the parallel summation process of n
numbers with n processors
E
S (n logn)
1

 (
)
p
n
logn
• Notice the speedup decreases with the size of the problem
• That is, there is no point in using a large number of
processors for computations with speedups like this
A More Satisfactory Example
•
18
Edge detection of an n  n pixel image
• Applies a 3  3 multiplicitive template with summation
(convolution) to each pixel
• The serial computation time TS is 9 tc n2 where tc is the
average time for a multiply-add arithmetic operation
• The parallel algorithm using p processors divides the
image into p column slices with n/p columns per slice
• The computation for the pixels on each processor is local
except for the left and right edges which require the pixel
values from the edge column of neighboring processes
• Thus, the parallel time is: TP = 9 tc n2 / p + 2(ts + n tw)
• The speedup and efficiency are: TS
S
• E increases with increasing n
• E decreases with increasing p
E
TP

S

p
9t c n 2
9t c n 2 / p  2(t s  ntw )
1
1
2 p(t s  ntw )
9t c n 2
19
Convolution Of A Pixel Image
Pixel Image
-1
0
1
-2
0
2
-1
0
1
-1
-2
1
0
0
0
-1
2
1
Two different
convolution templates
for detecting edges
P0
P1
P2
P3
Image partitioning amongst
processes and data sharing
for process 1
Parallel Cost And Cost-Optimality
•
•
Defined as the number of processors p times the parallel
computing time Tp
The term cost-optimal refers to a parallel system that has the
following property:
• The parallel system has a cost that has the same
asymptotic growth as the fastest serial algorithm to solve
the same problem; that is, has an efficiency of (1)
• Example: Adding numbers in parallel is (n log n)
• Adding the numbers serially is (n)
• This parallel addition algorithm is NOT cost-optimal
• Questions:
• Can this algorithm or any algorithm be made cost optimal by
increasing the granularity? That is, decrease p and increase
the work of each proc?
• Sometimes, but in general no. See the examples on the
next few slides
• Is there a cost-optimal parallel algorithm for summing n
numbers?
• Yes. But the cost-optimal algorithm is different from the
20
The Importance Of The Cost-Optimal Metric
•
•
Consider a sorting algorithm to sort n elements using n
processors that is not cost optimal. What does this mean in
practice?
Scalability performance is very poor. Let's see why
• Suppose this algorithm takes (log n)2 time to sort the list
• The best serial time is known to be n log n
• The parallel algorithm has a speedup and efficiency of
n/log n and 1/log n respectively.
• The parallel algorithm is an improvement but not cost
optimal because speedup is not (1)
• Now, consider increasing the granularity by decreasing the
number of processors from n to p < n.
• The new parallel algorithm will take no more than n(log
n)2/p time
• Its speedup is (nlog n)/(n(log n)2/p) = p/log n and its
efficiency is the same
• For example, with 32 processors and n = 1024 and n =
106, the speedups are 3.2 and 1.6 respectively -- very
poor scalability
21
Two Different Parallel Summation
Algorithms
•
•
The second sum algorithm uses the idea of increasing the
granularity of the previous parallel algorithm, also called
scaling down (see the next few slides for n = 16 and p = 4)
• Instead of using n processors, use p < n and place more of the
addends, namely n/p, on each processor
• Add the corresponding numbers on each processor as with the
first algorithm (communication) until all partial sums are on one
processor
• Add the partial sums in pairs using the approach of the first
parallel approach but on one processor
The third sum algorithm is a modification of the above
• Add up all the numbers on each processor first using the usual
serial algorithm and then apply the first parallel algorithm to p
addends on p processors
22
Second Algorithm (n=16, p=4) -- First Step
12
8
4
0
13
9
5
1
14
10
6
2
15
11
7
3
P0
P1
P2
P3
12
8
4
0 1
P0
Initial distribution and
first communication step
12
8
4 5
0 1
P0
13
9
P1
14
10
6 7
2 3
P2
15
11
P3
Substep 3 distribution and
next communication step
13
9
5
14
10
6
2 3
P1
23
15
11
7
P2
P3
Substep 2 distribution and
next communication step
12
8 9
4 5
0 1
P0
13
P1
14
1011
6 7
2 3
P2
15
P3
Substep 4 distribution and
next communication step
1213
8 9
4 5
0 1
P0
1415
1011
67
23
P1
P2
P3
Data distribution after last
substep
24
Second Algorithm (n=16, p=4) -- Second Step
1213
8 9
4 5
0 1
P0
1415
1011
6 7
2 3
P1
P2
1213
8 9
4 5
0 3
P3
P0
Initial distribution before first
substep and communication
1213
8 9
4 7
0 3
P0
1415
1011
P1
P2
Substep 3 distribution before third
substep and communication
P0
P1
P2
P3
Substep 2 distribution before second
substep and communication
1213
811
4 7
0 3
P3
1415
1011
6 7
1415
P1
P2
1215
811
4 7
0 3
P3
Substep 4 distribution before fourth
substep and communication
P0
P1
P2
P3
Final distribution after last substep
Second Algorithm (n=16, p=4) -- 3rd & 4th Step
1215
811
4 7
0 3
P0
P1
P2
P3
1215
811
815
0 7
0 7
P0
Data distribution before first substep
and grouping of operations
P2
P3
Data distribution before second
substep and grouping of operations
815
P0
P1
P0
P1
P2
P1
P2
P3
Final result
Data distribution before first substep
and grouping of operations
P2
P3
Final distribution of data after last
substep
015
0 7
P0
P1
25
P3
Third Algorithm (n=16, p=4) -- A Cost Optimal
Algorithm
3
2
1
0
7
6
5
4
11
10
9
8
15
14
13
12
P0
P1
P2
P3
Initial data distribution and
grouping of operations
0 3
4 7
811
P0
P1
P2
1215 07
P3
Data distribution after first step
and first communication step
015
P0
P1
P2
P3
Final result after the last step
P0
26
815
P1
P2
P3
Data distribution after second step
and second communication step
The Effect Of Granularity
•
Increasing the granularity of a cost-optimal
algorithm maintains cost optimality
• Suppose we have a cost-optimal algorithm using p
processors and we increase its granularity by reducing the
number of processors to q and increase the work per
processor
• The work per processor would increase by a factor p/q
• The communication per processor should also grow by no
more than a factor p/q provided the mapping is done carefully
• Thus, the parallel time would increase by a factor p/q
• Then parallel cost of the new algorithm would be qTp_new
which is q(p/q)Tp_old = pTp_old
• Thus, granularity has not changed the cost of a costoptimal algorithm
• Thus, to produce cost-optimal parallel codes from non-cost
optimal parallel code, you may have to do more than
increase granularity but it may help
• The two new sum algorithms illustrate this point
• The above proof does NOT show that increasing granularity
27
Analysis Of The Sum Algorithms
•
•
•
Serial sum algorithm costs: (n)
The first parallel algorithm costs :(n log n)
The second parallel algorithm is:
•
•
•
•
•
•
n/p steps with log p sub-steps, taking ((n/p)log p)
Then, we add n/p numbers, taking (n/p)
Total time is: ((n/p)log p)
Cost is: p ((n/p)log p) = (n log p)
This is asymptotically higher than the serial algorithm and still is
not cost optimal
The third parallel algorithm is:
• The first step is n/p additions
• The second step consists of log p sub-steps consisting of an
addition and communication
• Thus, the time is (n/p + log p) and the cost is (n + p log p)
• As long as p is not too large, like n = (p log p) , the cost is (n),
which means this algorithm is cost optimal
28
Scalability Of Parallel Systems
•
We typically develop parallel programs from small
test cases
• It is very difficult to predict scalability (performance for large
problems) from small test cases, unless you have done the
analysis first
• We now study some tools to help in the prediction process
•
• See the FFT case study based on observation of performance
in the small and its relation to performance for large sized
problems
Topics:
• Scaling characteristics of parallel programs
• Isoefficiency metric of scalability
• Problem size
• Isoefficiency function
• Cost optimality and the isoefficiency function
• A lower bound on the isoefficiency function
• Degree of concurrency and the isoefficiency function
29
FFT Case Study
•
Three algorithms for performing FFTs
• Algorithms described in detail in Chapter 13
•
• Binary exchange
• 2-D transpose
• 3-D transpose
Speedup data given for 64 processors, for the FFT
size n varying from 1 to 18K elements
• For small n (up to 7000 or so), 3-D transpose and binary
exchange are best -- a lot of testing to see this
• For large n, 2-D transpose outperforms the others and
continues faster for n > 14000 -- Can you believe this remains
true though for even larger n?
• Not unless you have done the analysis to support the
conjectured asymptotic behavior
30
Scaling Characteristics Of Parallel Programs
•
•
The efficiency is: E = S/p = TS/(pTP)
Using the expression involving TO (overhead -- slide
1
9)
E 
1
TO
TS
• Unfortunately, the overhead is at least linear in p unless the
algorithm is completely parallel
• Say, the parallel algorithm has a serial time of Tserial
• Then, all but one of the processors is idle during the time one
processor is performing the serial computation
• Thus, the overhead is at least (p–1)Tserial
• Therefore, the efficiency is bounded
above by:
1
E 
1
( p  1)Tserial
TS
31
Scaling Characteristics Continued
•
•
From this expression for E (previous slide)
• The efficiency E decreases with the number of processors for a
given problem size
• The efficiency E increases with larger problems (TS increases)
Consider the cost optimal summation algorithm
• For this algorithm (assume unit time for addition and
communication):
n
TP   2 log p
• n/p is the time for adding n/p items
p
• 2log p is the time for the addition and
n
S

communication of phase 2
n
 2 log p
• See the disappointing results for large p
p
on the next slide
1
E

• See how an efficiency level of, say, 80%
2 p log p
1

can be maintained by increasing n for
n
each p
32
Speedup Curves
33
Speedup Of Cost-Optimal Addition Algorithm
100
S ----->
n=64
n=192
10
n=320
n=512
Linear
1
1
2
4
8
16
32
p ----->
•
Plots of S = n/(n/p + 2 log p) for the cost-optimal addition parallel
algorithm for changing p and n
34
Efficiency Tables -- A Table Of Values
Of E, For Different p and n
•
n
p=1
p=4
p=8
p = 16 p = 32
64
1.0
0.80
0.57
0.33
0.17
192
1.0
0.92
0.80
0.60
0.38
320
1.0
0.95
0.87
0.71
0.50
512
1.0
0.97
0.91
0.80
0.62
The function of p representing work (n) that keeps the efficiency fixed
(with increasing p) is the isoefficiency function -- the bold entries are
80% efficiency
Scalability
•
Overhead varies with the serial time (the amount of work) and
the number of processors
• Clearly, overhead (communication) typically increases with the
number of processors
• It often increases with the amount of work to be done, usually
indicated by the sequential time TS
• However, as the problem size increases, overhead usually increases
sublinearly as a percentage of the work
• This means that the efficiency increases with the problem size
even when the number of processors is fixed
• For example, look at the columns of the last table
• Also, an efficiency level can be maintained by increasing both
•
the number of processors p and the amount of work
A parallel system that is able to maintain a specific efficiency in
this manner is called a scalable parallel system
• The scalability is a measure of a system's capacity to increase
speedup in proportion to the number of processing elements
35
Scalability And Cost Optimality
•
Recall: Cost optimal algorithms have an efficiency
of (1)
• Scalable parallel systems can be always made cost-optimal
• Cost-optimal algorithms are scalable
• Example: the cost-optimal algorithm for adding n numbers
• Its efficiency is: E = 1/(1+2p(log p)/n)
• Setting E equal to a constant, say K, means that n and p
must vary as n = 2(K/(1–K))p log p
• Thus, for any p, the size of n can be selected to maintain
efficiency K
• For example, for K = 80%, then, if 32 processors are
used, then the problem size of 1480 must be used
(Recall: this efficiency formula was based on adding 2
numbers and communicating 1 number took unit time -not a very realistic assumption with current hardware)
36
Isoefficiency Metric Of Scalability
•
Two observations:
• Efficiency always decreases with increase in the number of
processors, approaching 0 for a large number of processors
• Efficiency often increases with the amount of work or size of the
problem, approaching 1
• It sometimes can then decrease for large work sizes (run out of
memory, for example)
• A scalable system is one for which the efficiency can be made
constant by increasing both the work and the number of
processors
• The rate at which, for a fixed efficiency, the work or the problem
size must increase with respect to the number of processors to
maintain a fixed efficiency is called the degree of scalability of the
parallel system
• This definition depends upon a clear definition of problem size
• Once problem size is defined, then the function determining
problem size for varying number of processors and fixed efficiency
is called the isoefficiency function
37
Problem Size
•
38
Define problem size as the number of basic
operations, such as arithmetic operations, data stores
and loads, etc.
• It needs to be a measure that if the problem size is doubled, the
computation time is doubled
• Measures such as the size (order) of a matrix are misleading
• Double the size of a matrix causes a computation dominated
by matrix-matrix multiplication to increase by a factor of 8
• Double the size of a matrix causes a computation dominated
by matrix-vector multiplication to increase by a factor of 4
• Double the size of vectors causes a computation dominated by
vector dot product to increase by a factor of 2
• In the formulas that follows, they assume the basic arithmetic
operation takes 1 unit of time
• Thus, the problem size W is the same as the serial time TS of
the fastest known serial algorithm
39
The Isoefficiency Function -- The General Case
•
Parallel execution time TP is a function of:
•
• the problem size W,
• overhead function TO, and
• the number number p of processors
Let's fix the efficiency to E and solve for W in
terms of p and the fixed efficiency
• Let K = E/(1–E) and we get:
W  TO (W , p )
TP 
p
W
S 
TP
Wp

W  TO (W , p)
E
S
p

Wp
W  TO (W , p )

1
1  TO (W , p ) / W
p
Development Continued
•
Now solving for W gives:
E (1  TO (W , p ) / W )  1
1  TO (W , p ) / W  1 / E
1 E
E
1 / E  1  TO (W , p ) / W
1/ E 1 
E
W 
TO (W , p )  KTO (W , p )
1 E
• In the above equation, K is a constant
• For each choice of W, we solve for p
• This can usually be done algebraically
• Or, for each choice of p, we solve for W, possibly a non-linear
equation, that has to be solved numerically or approximated
somehow
• The resulting function of W in terms of p is called the
isoefficiency function
40
Analysis Of The Isoefficiency
Function
•
Suppose this function has the property that:
• Small changes in p give small changes in W
• Then, the system is highly scalable
• Small changes in p result in large changes in W
• Then, the system is not very scalable
•
The isoefficiency function may be difficult to find
• You can solve the above equation only with specific finite
values and cannot solve it in general
•
The isoefficiency function may not exist
• Such systems are not scalable
41
Two Examples
•
Consider the cost-optimal addition algorithm
• The overhead function is 2p log p
• Thus, the isoefficiency function (from slide 40) is: 2Kp log p
• That is, if the number of processors is doubled, the size of the
problem must be increased by a factor of 2 (1+ log p)/log p
•
• In this example, the overhead is only a function of the
number of processors and not of the work W
• This is unusual
Suppose we had a parallel problem with the
following overhead function
TO = p3/2 + p3/4W3/4
• The equation to solve for W in terms of p is: W = K p3/2 +
Kp3/4W3/4
• For fixed K, this is a polynomial of degree 4 in W with multiple roots
• Although this case can be solved analytically, in general such
equations cannot be solved analytically
• See the next slide for an appropriate approximate solution that
provides the relevant asymptotic growth
42
Second Example Continued -Finding An Approximate Solution
•
The function W is the sum of two positive terms
that increase with increasing p
• Let's take each term separately, find the growth rate
consistent with each, and take the maximum growth rate as
the asymptotic growth rate
• For just the first term, W = K p3/2 and W = (p3/2)
• For just the second term, W = K p3/4 W3/4
• Solving for W gives W = (p3)
• The faster growth rate is (p3)
• Thus, for this parallel system, the work must grow like p3 in
order to maintain a constant efficiency
• Thus the isoefficiency function is (p3)
43
Your Project
• Perform the corresponding analysis for your
implementation and determine the
isoefficiency function, if it exists
• It is the analysis that allows you to test the
performance of your parallel code on a few test
cases and then allows you to predict the
performance in the large
• The test cases have to be selected so that your results
do not reflect initial condition or small case effects
44
Cost-Optimality And The
Isoefficiency Function
• Consider a cost-optimal parallel algorithm
• An algorithm is cost-optimal iff the efficiency is
(1)
• That is, E = S/p = W/(pTP)  pTP = (W)
• But pTP = W + TO(W,p) so that TO(W, p) = (W)
• Thus, W = (TO(W, p))
• Thus, an algorithm is cost-optimal iff its overhead
does not exceed its problem size asymptotically
• In addition, if there exists an isoefficiency function
f(p), then the relation W = (f(p)) must be
satisfied in order to ensure the cost-optimality of
the parallel system
45
A Lower Bound On the Isoefficiency
Function
•
We desire the smallest isoefficiency function (recall
degrees of scalability -- eg. slides 37, 41)
•
How small can the isoefficiency function be?
• The smallest possible function is (p)
• Argument:
• For W work, the maximum number of processors is W
because the processors in excess of the number W will
be idle -- no work to do
• For a problem size growing slower than (p), if the
number of processors grows like order p, then eventually
there are more processors than work
• Thus, such a system will not be scalable
• Thus, the ideal function is (p)
• This is hard to achieve -- the cost optimal add algorithm has
an isoefficiency function of (p log p)
46
The Degree Of Concurrency And The
Isoefficiency Function
• The degree of concurrency C(W) is the
maximum number of tasks that can be
executed concurrently for a computation
with work W
• This degree of concurrency thus must limit the
isoefficiency function. That is,
• For a problem size W with a degree of concurrency
C(W), at most C(W) processors can be used
effectively
• Thus, the isoefficiency function can be no better
than (C(W))
47
An Example
• Consider solving Ax = b (a linear system of
size n) via Gaussian elimination
• The total computation is (n3) time
• We eliminate one variable at a time in a serial fashion,
taking (n2) time
• Thus, at most n2 processing elements can be used
• Thus, the degree of concurrency is (W2/3)
• Thus, from p = (W2/3), W = (p3/2)
which is the isoefficiency function
• Thus, this algorithm cannot reach the ideal or
optimal isoefficiency function (p)
48
Minimum Execution Time and Minimum
Cost-Optimal Execution Time
• Parallel processing time TP often decreases as the
number p of processors increases until either TP
approaches a minimum asymptotically or increases
• The question now is what is that minimum and is it useful to
know?
• We can find this minimum by taking the derivative of the
function of parallel time with respect to the number of
processors, setting this derivative to 0, and solving for the p
that satisfies this derivative equation
• Let p0 be the value of p for which the minimum is attained
• Let Tpmin be the minimum parallel time
• Let's do this for the parallel summation system we are
working with
49
An Example
•
Consider the cost-optimal algorithm for adding n
numbers
• Its parallel time is (slide 32): TP = n/p + 2 log p
• The equation from the first derivative set to 0 is: –n/p2 + 2/p
=0
• The solution is: p0 = p = n/2
• The minimum parallel time is: Tpmin = 2 log n
• The processor product time (work) is p Tpmin = (n log n)
• This is larger than the serial time which is (n)
•
Thus, for this minimum time, the problem is not
being solved cost-optimally (larger than the serial
time)
50
The Cost-Optimal Minimum Time
cost_opt
Tp
•
Let's characterize and find the minimum for the
computation performed cost-optimally
• Cost-optimality can be related to the isoefficiency function
and vice versa (see slide 45)
• If the isoefficiency function of a parallel system is (f(p)),
then the problem of size W can solved cost-optimally iff W =
(f(p))
• That is, a cost-optimal solution requires p = (f- –1(p))
• The parallel run-time is: TP = (W/p), (because pTP = (W))
• Thus, a lower bound on the parallel runtime for solving a
problem of size W cost-optimally is:
Tpcost_opt = (W/ f- –1(p))
51
The Example Continued
•
Estimating Tpcost_opt for the cost-optimal addition
algorithm
•
•
After some algebra, we get:
Tpcost_opt = 2 log n – log log n
Notice that Tpmin and Tpcost_opt are the same
asymptotically, that is, both (log n)
•
•
This is typical for most systems
It is not true in general and we can have the situation
that Tpcost_opt > (Tpmin)
52
An Example Of Tpcost_opt > (Tpmin)
• Consider the hypothetical system with (slide
42):
•
•
•
•
TO = p3/2 + p3/4W3/4
Parallel runtime is: TP = (W + TO)/p = W/p + p1/2 +
W3/4/p1/4
Taking the derivative to find Tpmin gives: p0 = (W)
Substituting back in to give Tpmin gives: Tpmin = (W1/2)
According to slide 43, the isoefficiency function W =
(p3) = f(p)
• Thus, p = f -1(p) = (W1/3)
• Substituting into the equation for Tpcost_opt on slide 51 gives
Tpcost_opt = (W2/3)
•
Thus, Tpcost_opt > (Tpmin)
• This does happen often
53
Limitation By Degree Of
Concurrency C(W)
•
Beware:
• The study of asymptotic behavior is valuable and
interesting, but increasing p asymptotically is
unrealistic
• For example, p0 larger than C(W) is meaningless
• For such cases, Tpmin is:
TPmin
W  TO (W , C (W ))

C (W )
• Needless to say, for problems where W
grows unendlessly, C(W) may also grow
unendlessly so that considering large p is
reasonable
54
Asymptotic Analysis Of Parallel Programs
Table For 4 Parallel Sort Programs of n Numbers
55
Algorithm
A1
A2
A3
A4
p
n2
log n
n
n
TP
1
n
n
n log n
S
n log n
log n
n log n
n
E
(log n)/n
1
(log n)/ n
1
n2
n log n
n1.5
n log n
pTP
• Recall: the serial best time is n log n
• Question: Which is the best?
Comments On The Table
•
•
•
Comparison by speed TP:
• A1 is best followed by A3, A4, and A2
But A1 is not practical for large n
• It requires n2 processors
Let's compare via efficiency E:
• A2 and A4 are best, followed by A3 and A1
• Look at the costs pTP now:
•
•
• A2 and A4 are cost-optimal where A3 and A1 are not
Overall, then A2 is the best …
• if least number of processors is important
Overall, then A4 is the best …
• if least parallel time is important
56
Other Scalability Metrics
•
Other metrics to handle less general cases have
been developed
• For example:
•
•
• Metrics that deal with problems that must be solved in a
specified time -- real time problems
• Metrics that deal with the fact that memory may be the limiting
factor and scaling of the number of processors may be
necessary, not for increased performance, but because of
increased memory
• That is, memory scales linearly with the number of
processors p
Scaled speedup
Serial fraction
57
Scaled Speedup
• Analyze the speedup, increasing the
problem size linearly with the number of
processors
• This analysis can be done by constraining either
time or memory in the analysis
• To see this, consider the following two examples:
• an parallel algorithm for matrix-vector products
• an parallel algorithm for matrix-matrix products
58
Scaled Speedup For Matrix-Vector
Products
•
•
•
The serial time Ts performing matrix-vector
product for a matrix of size nn is: tc n2
• where tc is the time for a multiply-add operation
Suppose the parallel time Tp for a simple parallel
algorithm (Section 8.2.1) is:
n2
TP  t c
 t s log p  t w n
Then, the speedupp is:
S
tc n 2
n2
tc
 t s log p  t w n
p
59
Scaled Speedup For Matrix-Vector
Products Continued
•
Consider a memory scaling constraint
• Require the memory to scale as (p)
• But the memory requirement for the matrix is (n2)
• Therefore n2 = (p) or n2 = c  p
• Substituting into the speedup formula, we get:
t c cp
•
c1 p
S'

cp
c2memory
 c3 log pconstraint
 c4 p
Thus, the tscaled
speedup
with
a

t
log
p

t
cp
c
s
w
p
is (p)
60
Scaled Speedup For Matrix-Vector
Products Continued
• Consider a time scaling constraint
• Require the time to be constant as the number of
processors increases
• But the parallel time is (n2/p)
• Therefore n2/p = c or n2 = c  p
• This is the same requirement as for the memory
constrained case and so the same scaled
speedup results
• Thus, the scaled speedup with a time
constraint is also (p)
61
Scaled Speedup For Matrix-Matrix
Products
•
•
•
The serial time Ts performing matrix-matrix
product for a matrix of size nn is: tc n3
• where tc is the time for a multiply-add operation
Suppose the parallel time Tp for a simple parallel
algorithm (Section 8.2.1) is:
n3
n2
TP  tc
 t s log p  2t w
p
p
Then, the speedup is:
S
tc n 3
n3
n2
tc
 t s log p  2t w
p
p
62
Scaled Speedup For Matrix-Matrix
Products Continued
•
Consider a memory scaling constraint
• Require the memory to scale as (p)
• But the memory requirement for the matrix is (n2)
• Therefore n2 = (p) or n2 = c  p
• Substituting into the speedup formula, we get:
t c (cp)1.5
•
c1 p
S'

 O( p)
1.5
log p
(cp)
cp
c2  c3 constraint
Thus, the
a memory
t c scaled
 tspeedup
s log p  2twith
w
p
p
is (p), thatp is, linear
63
Scaled Speedup For Matrix-Matrix
Products Continued
•
Consider a time scaling constraint
• Require the time to be constant as the number of
processors increases
• But the parallel time is (n3/p)
• Therefore n3/p = c or n3 = c  p
• Substituting into the speedup formula, we get:
S''
t c cp
cp
(cp) 2 / 3
tc
 t s log p  2t w
p
p
•

c1 p
c2  c3 log p  c4 p1/ 6
 O( p 5 / 6 )
Thus, the scaled speedup with a time constraint is
also (p5/6), that is, sublinear speedup
64
Serial Fraction
• Used as with the other measures to indicate
•
the nature of the scalability of a parallel
algorithm
What is it?
• Assume the work W be broken into two parts
• The part that is totally serial, denoted as Tser
• We assume this includes all the interaction time
• The part that is totally parallel, denoted as Tpar
• Then the work W = Tser + Tpar
• Define the serial fraction as f = Tpar/W
• Now, we seek an expression for f in terms of p
and S in order to study how f changes with p
65
Serial Fraction Continued
•
•
From the definition of TP:
TP  Tser 
T par
p
 Tser
W  Tser
W  fW

 fW 
p
p
Using the relation S = W/TP and solving for f gives a formula
for f in terms of S and f :
TP
1
1 f
  f 
W
S
p
1/ S 1/ p p / S 1
f 

• It is not clear how f1varies
with
 1 / p p here
p 1
• If f increases with increasing p, the system is considered scalable
• Let's look at what this formula tells us for the matrix-vector
product
66
Serial Fraction Example
• For the matrix vector product:
p / S  1 ( pts log p  t w np) /(t c n 2 )  1
f 

p 1
p 1
t s log p  t w n

• This indicates
t c n 2 that the serial fraction f grows with
increasing p and so the parallel algorithm is considered
scalable
67

Introduction To The Numerical Solution Of Ordinary

Transcript Introduction To The Numerical Solution Of Ordinary

Directory