CSE 8383 - Advanced Computer Architecture Week-5 Week of Feb 9, 2004

Download Report

Transcript CSE 8383 - Advanced Computer Architecture Week-5 Week of Feb 9, 2004

CSE 8383 - Advanced
Computer Architecture
Week-5
Week of Feb 9, 2004
engr.smu.edu/~rewini/8383
Contents






Project/Schedule
Introduction to Multiprocessors
Parallelism
Performance
PRAM Model
….
Warm Up


Parallel Numerical Integration
Parallel Matrix Multiplication
In class: Discuss with your neighbor!
Videotape: Think about it!
What kind of architecture do we need?
Explicit vs. Implicit Paralleism
Parallel
program
Sequential
program
Parallelizer
Programming Environment
Parallel Architecture
Motivation



One-processor systems are not capable of
delivering solutions to some problems in
reasonable time
Multiple processors cooperate to jointly
execute a single computational task in order
to speed up its execution
Speed-up versus Quality-up
Multiprocessing
One-processor
Multiprocessor
Speed-up
Physical limitations
N processors cooperate to solve a single
computational task
Quality-up
Sharing
Flynn’s Classification- revisited




SISD (single instruction stream over a single
data stream)
SIMD (single instruction stream over multiple
data stream)
MIMD (multiple instruction streams over
multiple data streams)
MISD (multiple instruction streams and a
single data streams)
SISD (single instruction stream over a
single data stream)

SISD uniprocessor architecture
IS
IS
I/O
CU
DS
PU
MU
Captions:
CU = control unit
PU = Processing unit
MU = memory unit
IS = instruction stream
DS = data stream
PE = processing element
LM = Local Memory
SIMD (single instruction stream over
multiple data stream)
PE1
DS
LM1
DS
IS
Program loaded
from host
CU
IS
PEn
SIMD Architecture
DS
LMn
DS
Data
sets
loaded
from
host
MIMD (multiple instruction streams over
multiple data streams)
IS
CU1
I/O
IS
PU1
DS
Shared
Memory
I/O
CU1
IS
PUn
DS
IS
MMD Architecture (with shared memory)
MISD (multiple instruction streams
and a single data streams)
IS
IS
Memory
(Program
and data)
CU1
CU2
IS
DS
PU1
CUn
IS
DS
PU2
IS
DS
DS
I/O
MISD architecture (the systolic array)
PUn
System Components

Three major Components

Processors

Memory Modules

Interconnection Network
Memory Access

Shared Memory
M
P

P
Distributed Memory
P
P
M
M
Interconnection Network
Taxonomy
Interconnection Network
Dynamic
Static
1-D
2-D
HC
Bus-based
Single
Multiple
Switch-based
SS
MS
Crossbar
MIMD Shared Memory
Systems
M
M
M
M
Interconnection Networks
P
P
P
P
P
Shared Memory



Single address space
Communication via read & write
Synchronization via locks
Bus Based & switch based SM
Systems
Global Memory
C
C
C
P
P
P
P
C
P
C
P
C
P
C
M
M
M
M
Cache Coherent NUMA
M
M
M
M
C
C
C
C
P
P
P
P
Interconnection
Network
MIMD Distributed Memory
Systems
M
M
M
M
P
P
P
P
Interconnection Networks
Distributed Memory



Multiple address spaces
Communication via send & receive
Synchronization via messages
SIMD Computers
Processor
Memory
von Neumann Computer
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
Some Interconnection Network
SIMD (Data Parallel)



Parallel Operations within a computation
are partitioned spatially rather than
temporally
Scalar instructions vs. Array instructions
Processors are incapable of operating
autonomously  they must be diven by
the control uni
Past Trends in Parallel
Architecture (inside the box)

Completely custom designed
components (processors, memory,
interconnects, I/O)



Longer R&D time (2-3 years)
Expensive systems
Quickly becoming outdated

Bankrupt companies!!
New Trends in Parallel
Architecture (outside the box)





Advances in commodity processors and
network technology
Network of PCs and workstations connected
via LAN or WAN forms a Parallel System
Network Computing
Compete favorably (cost/performance)
Utilize unused cycles of systems sitting idle
Clusters
Programming Environment
Middleware
OS
M
C
P
OS
I/O
M
C
OS
I/O
P
Interconnection
Network
M
C
P
I/O
Grids


Grids are geographically distributed
platforms for computation.
They provide dependable, consistent,
pervasive, and inexpensive access to
high end computational capabilities.
Problem
Assume that a switching component such as a transistor
can switch in zero time. We propose to construct a diskshaped computer chip with such a component. The only
limitation is the time it takes to send electronic signals
from one edge of the chip to the other. Make the
simplifying assumption that electronic signals travel
300,000 kilometers per second. What must be the
diameter of a round chip so that it can switch 109 times
per second? What would the diameter be if the switching
requirements were 1012 time per second?
Grosch’s Law (1960s)



“To sell a computer for twice as much,
it must be four times as fast”
Vendors skip small speed improvements
in favor of waiting for large ones
Buyers of expensive machines would
wait for a twofold improvement in
performance for the same price.
Moore’s Law




Gordon Moore (cofounder of Intel)
Processor performance would double
every 18 months
This prediction has held for several
decades
Unlikely that single-processor
performance continues to increase
indefinitely
Von Neumann’s bottleneck




Great mathematician of the 1940s and 1950s
Single control unit connecting a memory to a
processing unit
Instructions and data are fetched one at a
time from memory and fed to processing unit
Speed is limited by the rate at which
instructions and data are transferred from
memory to the processing unit.
Parallelism

Multiple CPUs

Within the CPU


One Pipeline
Multiple pipelines
Speedup




S
S
S
S
=
=
=
=
Speed(new) / Speed(old)
Work/time(new) / Work/time(old)
time(old) / time(new)
time(before improvement) /
time(after improvement)
Speedup

Time (one CPU): T(1)

Time (n CPUs): T(n)

Speedup: S

S = T(1)/T(n)
Amdahl’s Law
The performance improvement to be
gained from using some faster mode of
execution is limited by the fraction of
the time the faster mode can be used
Example
20 hours
B
A
must walk
Walk 4 miles /hour
Bike 10 miles / hour
Car-1 50 miles / hour
Car-2 120 miles / hour
Car-3 600 miles /hour
200 miles
 50 + 20 = 70 hours
 20 + 20 = 40 hours
 4 + 20 = 24 hours
 1.67 + 20 = 21.67 hours
 0.33 + 20 = 20.33 hours
S=1
S = 1.8
S = 2.9
S = 3.2
S = 3.4
Amdahl’s Law (1967)


 : The fraction of the program that is
naturally serial
(1- ): The fraction of the program that
is naturally parallel
S = T(1)/T(N)
T(N) = T(1) +
T(1)(1-  )
N
1
N
S=
=
(1
)
+
N + (1-  )
N
Amdahl’s Law
Gustafson-Barsis Law
N &  are not independent from each other
a:
The fraction of the program that is naturally serial
T(N) = 1
T(1) = a + (1- a ) N
S = N – (N-1) a
Gustafson-Barsis Law
Comparison of Amdahl’s Law vs
Gustafson-Barsis’ Law
Example
For I = 1 to 10 do
begin
S[I] = 0.0 ;
for J = 1 to 10 do
S[I] = S[I] + M[I, J];
S[I] = S[I]/10;
end
Distributed Computing
Performance

Single Program Performance

Multiple Program Performance
PRAM Model
What is a Model?


According to Webster’s Dictionary, a
model is “a description or analogy used
to help visualize something that cannot
be directly observed.”
According to The Oxford English
Dictionary, a model is “a simplified or
idealized description or conception of a
particular system, situation or process.”
Why Models?

In general, the purpose of Modeling is
to capture the salient characteristics of
phenomena with clarity and the right
degree of accuracy to facilitate analysis
and prediction.
Megg, Matheson and Tarjan (1995)
Models in Problem Solving

Computer Scientists use models to help
design problem solving tools such as:



Fast Algorithms
Effective Programming Environments
Powerful Execution Engines
An Interface
Applications
A model is an
interface
separating high
level properties
from low level
ones
Provides
operations
MODEL
Requires
implementation
Architectures
PRAM Model






Synchronized
Read Compute
Write Cycle
EREW
ERCW
CREW
CRCW
Complexity:
T(n), P(n), C(n)
Control
Private
Memory
P1
Global
Private
Memory
P2
Private
Memory
Pp
Memory
The PRAM model and its variations (cont.)

There are different modes for read and write operations in a PRAM.




Exclusive read(ER)
Exclusive write(EW)
Concurrent read(CR)
Concurrent write(CW)





Common
Arbitrary
Minimum
Priority
Based on the different modes described above, the PRAM can be
further divided into the following four subclasses.




EREW-PRAM model
CREW-PRAM model
ERCW-PRAM model
CRCW-PRAM model
Analysis of Algorithms

Sequential Algorithms



Time Complexity
Space Complexity
An algorithm whose time complexity is bounded by a
polynomial is called a polynomial-time algorithm. An
algorithm is considered to be efficient if it runs in
polynomial time.
Analysis of Sequential Algorithms
NP
NP-hard
P
NP-complete
The relationships among P, NP, NP-complete, NP-hard
Analysis of parallel algorithm
Performance of a parallel algorithm is expressed in
terms of how fast it is and how much resources it
uses when it runs.



Run time, which is defined as the time during the execution
of the algorithm
Number of processors the algorithm uses to solve a problem
The cost of the parallel algorithm, which is the product of
the run time and the number of processors
Analysis of parallel algorithm
The NC-class and P-completeness
NP-hard
NP
P
NC
P-complete
NP-complete
The relationships among P, NP, NP-complete, NP-hard, NC, and P-complete
(if PNP and NC  P)
Simulating multiple accesses on an
EREW PRAM

Broadcasting mechanism:




P1 reads x and makes it known to P2.
P1 and P2 make x known to P3 and P4,
respectively, in parallel.
P1, P2, P3 and P4 make x known to P5, P6,
P7 and P8, respectively, in parallel.
These eight processors will make x know to
another eight processors, and so on.
Simulating multiple accesses on an
EREW PRAM (cont.)
x
x
L
P1
x
x
x
P2
L
L
L
x
x
x
x
x
x
x
P3
P4
x
x
x
x
x
x
(a)
(b)
(c)
Simulating Concurrent read on EREW PRAM with eight processors using Algorithm
Broadcast_EREW
P5
P6
P7
P8
(d)
x
x
x
x
x
x
x
Simulating multiple accesses on an
EREW PRAM (cont.)

Algorithm Broadcast_EREW
Processor P1
y (in P1’s private memory)  x
L[1]  y
for i=0 to log p-1 do
forall Pj, where 2i +1 < j < 2i+1 do in parallel
y (in Pj’s private memory)  L[j-2i]
L[j]  y
endfor
endfor
Bus-based Shared Memory
Global Memory
P
P
P
P
P

Collection of wires and connectors

Only one transaction at a time

Bottleneck!! How can we solve the problem?
Single Processor caching
x
Memory
x
Cache
Hit: data in the cache
Miss: data is not in the cache
P
Hit rate: h
Miss rate: m = (1-h)
Writing in the cache
x
Memory
x
Cache
P
Before
x’ Memory
x’
Cache
x
Memory
x’
Cache
P
P
Write through
Write back
Using Caches
Global Memory
C1
C2
C3
Cn
P1
P2
P3
Pn
- How many processors?
- Cache Coherence problem
Group Activity

Variables






Number of processors (n)
Hit rate (h)
Bus Bandwidth (B)
Processor speed (V)
Condition: n*(I - h)*v <= B
Maximum number of processors n = B/(1h)*v
Cache Coherence
x
x
P1
P2
x
x
P3
Pn
-Multiple copies of x
-What if P1 updates x?
Cache Coherence Policies

Writing to Cache in 1 processor case



Write Through
Write Back
Writing to Cache in n processor case




Write
Write
Write
Write
Update - Write Through
Invalidate - Write Back
Update - Write Through
Invalidate - Write Back
Write-invalidate
x’
x
x
P1
P2
Before
x
x’
P3
P1
P2
x
I
x’
P3
P1
Write Through
I
P2
Write back
P3
Write-Update
x’
x
x
P1
P2
Before
x
x’
P3
P1
P2
x
x’
x’
P3
P1
Write Through
x’
P2
Write back
P3
Synchronization
P1
P2
Lock
…..
…..
unlock
P3
wait
wait
Lock
…..
…..
unlock
Locks
Lock
…..
…..
unlock
Superscalar Parallelism
Scheduling