Document 7162745

Download Report

Transcript Document 7162745

Chapter 2

Parallel Architectures

Outline

• Some chapter references • Brief review of complexity – Terminology for comparisons • Interconnection networks • Processor arrays • Multiprocessors • Multicomputers • Flynn’s Taxonomy – moved to Chpt 1 2

Some Chapter References

• Selim Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989 (earlier textbook).

• G. C. Fox, What Have We Learnt from Using Real Parallel Machines to Solve Real Problems? Technical Report C3P-522, Cal Tech, December 1989. (Included in part in more recent books co-authored by Fox.) • A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, 2003 (first edition 1994), Addison Wesley.

• Harry Jordan, Gita Alaghband, Fundamentals of Parallel Processing: Algorithms, Architectures, Languages, Prentice Hall, 2003, Ch 1, 3-5.

3

References - continued

• Gregory Pfsiter, In Search of Clusters: The ongoing Battle in Lowly Parallelism, 2 nd Edition, Ch 2. (Discusses details of some serious problems that MIMDs incur).

• Michael Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill,2004 (Current Textbook), Chapter 2.

• Michael Quinn, Parallel Computing: Theory and Practice, McGraw Hill, 1994, Ch. 1,2 • Sayed H. Roosta, “Parallel Processing & Parallel Algorithms: Theory and Computation”, Springer Verlag, 2000, Chpt 1.

• Wilkinson & Allen, Parallel Programming: Techniques and Applications, Prentice Hall, 2 nd Edition, 2005, Ch 1 2.

4

Brief Review

Complexity Concepts Needed for Comparisons

• Whenever we define a counting function, we usually characterize the growth rate of that function in terms of complexity classes.

• Definition: We say a function f(n) is in O(g(n)), if (and only if) there are positive constants

c

and

n

0

such that

0 ≤ f

(

n

) 

cg

(

n

) for

n

n

0

O

(

n

) is read as big-oh of n.

• This notation can be used to separate counts into complexity classes that characterize the size of the count.

• We can use it for any kind of counting functions such as timings, bisection widths, etc.

5

Big-Oh and Asymptotic Growth Rate • The big-Oh notation gives an upper bound on the (asymptotic) growth rate of a function • The statement “

f

(

n

) is

O

(

g

(

n

)) ” means that the growth rate of

f

(

n

) is no more than the growth rate of

g

(

n

) • We can use the big-Oh notation to rank functions according to their growth rate

f

Assume:

g

( (

n n

) ) grows faster grows faster Same growth

f

(

n

) is

O

(

g

(

n

)) Yes No Yes

g

(

n

) is

O

No Yes Yes (

f

(

n

)) 6

• Relatives of Big-Oh

big-Omega

– f(n) is  (g(n)) if there is a constant c > 0 and an integer constant n f(n)  cg(n) ≥ for n  n 0 0  1 such that Intuitively, this says up to a constant factor, f(n) asymptotically is greater than or equal to g(n) •

big-Theta

– f(n) is  (g(n)) if there are constants c’ > 0 and c’’ > 0 and an integer constant n 0  1 such that 0 ≤ c’g(n)  f(n)  c’’•g(n) for n  n 0 Intuitively, this says up to a constant factor, f(n) and g(n) are asymptotically the same.

Relatives of Big-Oh • •

little-oh

– f(n) is o(g(n)) if, for any constant c > 0, there is an integer constant n  n 0 0  0 such that 0  f(n) < cg(n) for n Intuitively, this says f(n) is, up to a constant, asymptotically strictly less than g(n), so f(n) ≠  (g(n)).

little-omega

– f(n) is   (g(n)) if, for any constant c > 0, there is an integer constant n 0  0 such that f(n) > cg(n) ≥ 0 for n n 0 Intuitively, this says f(n) is, up to a constant, asymptotically strictly greater than g(n), so f(n) ≠  (g(n)).

These are not used as much as the earlier definitions, but they round out the picture.

8

Summary for Intuition for Asymptotic Notation

big-Oh

– f(n) is O(g(n)) if f(n) is asymptotically

less than or equal

to g(n)

big-Omega

– f(n) is  (g(n)) if f(n) is asymptotically

greater than or equal

to g(n)

big-Theta

– f(n) is  (g(n)) if f(n) is asymptotically

equal

to g(n)

little-oh

– f(n) is o(g(n)) if f(n) is asymptotically

strictly less

than g(n)

little-omega

– f(n) is  (g(n)) if is asymptotically

strictly greater

than g(n) 9

A CALCULUS DEFINITION OF O,

 (often easier to use)

Definition: Let f and g be functions defined on the positive integers with nonnegative values.

We say g is in O(f) if and only if lim g(n)/f(n) = c n ->

for some nonnegative real number c--- i.e. the limit exists and is not infinite.

Definition: We say f is in

(g) if and only if f is in O(g) and g is in O(f) Note: Often use L'Hopital's Rule to calculate the limits you need.

10

Why Asymptotic Behavior is Important • 1) Allows us to compare counts on large sets.

• 2) Helps us understand the maximum size of input that can be handled in a given time, provided we know the environment in which we are running.

• 3) Stresses the fact that even dramatic speedups in hardware do not overcome the handicap of an asymtotically slow algorithm.

11

Recall: ORDER WINS OUT

(Example from Baase’s Algorithms Text)

The TRS-80

Main language support: BASIC - typically a slow running interpreted language For more details on TRS-80 see: http://mate.kjsl.com/trs80/ The CRAY-YMP Language used in example: FORTRAN- a fast running language For more details on CRAY-YMP see: http://ds.dial.pipex.com/town/park/abm64/CrayWWWStuff/Cfaq p1.html#TOC3 12

n is: 10 100 1000 2500 10000 1000000 CRAY YMP TRS-80 with FORTRAN with BASIC complexity is 3n 3 complexity is 19,500,000n

microsecond (abbr µsec) One-millionth of a second.

millisecond (abbr msec) One-thousandth of a second.

3 microsec 3 millisec 3 sec

50 sec 49 min 95 years 200 millisec 2 sec 20 sec 50 sec

3.2 min 5.4 hours 13

Interconnection Networks

• Uses of interconnection networks – Connect processors to shared memory – Connect processors to each other • Interconnection media types – Shared medium – Switched medium • Different interconnection networks define different parallel machines.

• The interconnection network’s properties influence the type of algorithm used for various machines as it affects how data is routed.

14

Shared versus Switched Media

15

Shared Medium

• Allows only message at a time • Messages are broadcast • Each processor “listens” to every message • Before sending a message, a processor “listen” until medium is unused • Collisions require resending of messages • Ethernet is an example 16

Switched Medium

• Supports point-to-point messages between pairs of processors • Each processor is connected to one switch • Advantages over shared media – Allows multiple messages to be sent simultaneously – Allows scaling of the network to accommodate the increase in processors 17

Switch Network Topologies

• View switched network as a graph – Vertices = processors or switches – Edges = communication paths • Two kinds of topologies – Direct – Indirect 18

Direct Topology • Ratio of switch nodes to processor nodes is 1:1 • Every switch node is connected to – 1 processor node – At least 1 other switch node Indirect Topology • Ratio of switch nodes to processor nodes is greater than 1:1 • Some switches simply connect to other 19 switches

Terminology for Evaluating Switch Topologies • We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness in implementing efficient parallel algorithms on a machine with a given network.

• These are – The diameter – The bisection width – The edges per node – The constant edge length • We’ll define these and see how they affect algorithm choice.

• Then we will investigate several different topologies and see how these characteristics are evaluated.

20

Terminology for Evaluating Switch Topologies • Diameter – Largest distance between two switch nodes.

– Low diameter is good – It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes.

21

Terminology for Evaluating Switch Topologies • Bisection width – odd.) The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves (within 1 node, if the number of processors is • High bisection width is good.

• In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm, • Actually proving what the bisection width of a network is can be quite difficult.

22

Terminology for Evaluating Switch Topologies • Number of edges / node – It is best if the number of edges/node is a constant independent of network size as that allows more scalability of the system to a larger number of nodes.

Degree

is the maximum number of edges per node.

• Constant edge length? (yes/no) – Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size.

23

Evaluating Switch Topologies • Many have been proposed and analyzed. We will consider several well known ones: – 2-D mesh – linear network – binary tree – hypertree – butterfly – hypercube – shuffle-exchange • Those in yellow have been used in commercial parallel computers.

24

2-D Meshes

Note: Circles represent switches and squares represent processors in all these slides.

25

2-D Mesh Network

• Direct topology • Switches arranged into a 2-D lattice or grid • Communication allowed only between neighboring switches • Torus: Variant that includes wraparound connections between switches on edge of mesh 26

Evaluating 2-D Meshes

(Assumes mesh is a square) n = number of processors • Diameter: – –   ( (

n n

– Yes – Yes 1/2 1/2 ) ) – Places a lower bound on algorithms that require processing with arbitrary nodes sharing data.

• Bisection width: – Places a lower bound on algorithms that require distribution of data to all nodes.

• Max number of edges per switch: – 4 (note: this is the degree) • Constant edge length? • Does this scale well?

27

Linear Network

• Switches arranged into a 1-D mesh • Corresponds to a row or column of a 2-D mesh • Ring : A variant that allows a wraparound connection between switches on the end.

• The linear and ring networks have many applications • Essentially supports a pipeline in both directions • Although these networks are very simple, they support many optimal algorithms.

28

Evaluating Linear and Ring Networks • Diameter – Linear : n-1 or Θ(n) – Ring:  n/2  or Θ(n) • Bisection width: – Linear: 1 or Θ(1) – Ring: 2 or Θ(1) • Degree for switches: – 2 • Constant edge length? – Yes • Does this scale well?

– Yes 29

Binary Tree Network

• • Indirect topology

n

= 2

d

processor nodes, 2

n

-1 switches, where d= 0,1,... is the number of levels i.e. 2 3 = 8 processors on bottom and 2(n) – 1 = 2(8) – 1 = 15 switches 30

Evaluating Binary Tree Network

• Diameter: – 2 log n – Note- this is small • Bisection width: – 1, the lowest possible number • Degree: – 3 • Constant edge length? – No • Does this scale well?

– No 31

Hypertree Network (of degree 4 and depth 2) (a) Front view:

4

-ary tree of height

2

(b) Side view: upside down binary tree of height

d

(c) Complete network 32

Hypertree Network

• Indirect topology • Note- the degree k and the depth d must be specified. • This gives from the front a k-ary tree of height d.

• From the side, the same network looks like an upside down binary tree of height d.

• Joining the front and side views yields the complete network.

33

Evaluating 4-ary Hypertree with • Diameter: – log

n

n =16 processors – shares the low diameter of binary tree • Bisection width: –

n

/ 2 – Large value - much better than binary tree • Edges / node: – 6 • Constant edge length? – No 34

Butterfly Network • • Indirect topology

n

= 2

d

processor nodes connected by

n

(log

n

+ 1) switching nodes A 2 3 = 8 processor butterfly network with 8*4=32 switching nodes 0 1 2 3 4 5 6 7 Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 As complicated as this switching network appears to be, it is really quite simple as it admits a very nice routing algorithm !

Note: The bottom row of switches is normally identical with the top row.

Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 35 3,7 The rows are called ranks.

Building the 2 3 Butterfly Network • There are 8 processors.

• Have 4 ranks (i.e. rows) with 8 switches per rank.

• Connections: – Node(i,j), for i > 0, is connected to two nodes on rank i-1, namely node(i-1,j) and node(i-1,m), where m is the integer found by inverting the ith most significant bit in the binary d-bit representation of j.

– For example, suppose i = 2 and j = 3. Then node (2,3) is connected to node (1,3). – To get the other connection, 3 = 011 2 . So, flip 2 nd significant bit – i.e. 0 0 1 2 and connect node(2,3) to node(1,1) -- NOTE: There is an error on pg 32 on this example.

36

Why It Is Called a Butterfly Network

• Walk cycles such as node(i,j), node(i-1,j), node(i,m), node(i-1,m), node(i,j) where m is determined by the bit flipping as shown and you “see” a butterfly: 37

Butterfly Network Routing

Send message from processor 2 to processor 5.

Algorithm: 0 means ship left; 1 means ship right.

1) 5 = 101. Pluck off leftmost bit 1 and send “01msg” to right.

2) Pluck off leftmost bit 0 and send “1msg” to left.

3) Pluck off leftmost bit 1 and send “msg” to right.

38

Evaluating the Butterfly Network • Diameter: – log

n

• Bisection width: –

n

/ 2 • Edges per node: – 4 (even for d  3) Rank 0 Rank 1 • Constant edge length? – No – as rank decreases, Rank 2 grows exponentially Rank 3 0,0 1,0 2,0 3,0 0 1 2 3 4 5 6 7 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,1 3,2 3,3 3,4 3,5 3,6 39 3,7

Hypercube (or binary n-cube)

n = 2

d

processors and n switch nodes

1110 1111 0110 0111 1010 1011 0010 0011 1100 1101 0100 0101 1000 1001 0000 0001 Butterfly with the columns of switch nodes collapsed into a single node.

40

Hypercube (or binary n-cube) n = 2 d processors and n switch nodes • Direct topology • 2 x 2 x … x 2 mesh • Number of nodes is a power of 2 • Node addresses 0, 1, …, 2

k

-1 • Node

i

is connected to

k

nodes whose addresses differ from

i

in exactly one bit position.

• Example: k = 0111 is connected to 1111, 0011, 0101, and 0110 0010 0000 0110 0100 0011 0001 0111 0101 1010 1000 1110 1100 1011 1001 41 1111 1101

Growing a Hypercube Note: For d = 4, it is a 4 dimensional cube.

42

Evaluating Hypercube Network

• Diameter: • • log

n

/ 2 • log

n

•Bisection width: •Edges per node:

n

•Constant edge 0010 0110 0011 0111 1010 1110 1011 length? • No. The length of the longest edge 0100 0101 1000 1100 1001 increases as n increases.

0000 0001 43 1111 1101

Routing on the Hypercube Network • Example: Send a message from node 2 = 0010 to node 5 = 0101 • The nodes differ in 3 bits so the shortest path will be of length 3.

• One path is 0010  0110  0100  0101 obtained by flipping one of the differing bits at each step.

0010 0000 • As with the butterfly network, bit flipping helps you route on this network.

0110 0100 0011 0001 0111 0101 1010 1000 1110 1100 1011 1001 1111 1101 44

A Perfect Shuffle

• A permutation that is produced as follows is called a perfect shuffle : • Given a power of 2 cards, numbered 0, 1, 2, ..., 2 d -1, write the card number with d bits. By left rotating the bits with a wrap, we calculate the position of the card after the perfect shuffle.

• Example: For d = 3, card 5 = 101. Left rotating and wrapping gives us 011. So, card 5 goes to position 3. Note that card 0 = 000 and card 7 = 111, stay in position.

45

Shuffle-exchange Network Illustrated

0 1 2 3 4 5 6 7 • Direct topology • Number of nodes is a power of 2 • Nodes have addresses 0, 1, …, 2

d

-1 • Two outgoing links from node

i

Shuffle link to node

LeftCycle(i)

Exchange link between node i and node i+1 when i is even 46

Shuffle-exchange Addressing – 16 processors 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 No arrows on line segment means it is bidirectional. Otherwise, you must follow the arrows.

Devising a routing algorithm for this network is interesting and will be a homework problem.

47

Evaluating the Shuffle-exchange • Diameter: 2log

n

- 1 • Bisection width: 

n

/ log

n

• Edges per node: 3 • Constant edge length? •No 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 48

Two Problems with Shuffle-Exchange

• Shuffle-Exchange does not expand well – A large shuffle-exchange network does not compose well into smaller separate shuffle exchange networks.

– In a large shuffle-exchange network, a small percentage of nodes will be hot spots • They will encounter much heavier traffic • Above results are in dissertation of one of Batcher’s students. 49

Comparing Networks

• All have logarithmic diameter except 2-D mesh • Hypertree, butterfly, and hypercube have bisection width

n

/ 2 • All have constant edges per node except hypercube • Only 2-D mesh, linear, and ring topologies keep edge lengths constant as network size increases • Shuffle-exchange is a good compromise- fixed number of edges per node, low diameter, good bisection width.

– However, negative results on preceding slide also need to be considered.

50

Alternate Names for SIMDs

• Recall that all active processors of a SIMD computer must simultaneously access the same memory location. • The value in the i-th processor can be viewed as the i-th component of a vector. • SIMD machines are sometimes called

vector computers

[Jordan,et.al.] or

processor arrays

[Quinn 94,04] based on their ability to execute vector and matrix operations efficiently. 51

SIMD Computers

• SIMD computers that focus on vector operations – Support some vector and possibly matrix operations in hardware – Usually limit or provide less support for non vector type operations involving data in the “vector components”.

• General purpose SIMD computers – Support more traditional type operations (e.g., other than for vector/matrix data types). – Usually also provide some vector and possibly matrix operations in hardware.

52

Pipelined Architectures

• Pipelined architectures are sometimes considered to be SIMD architectures – See pg 37 of Textbook & pg 8-9 Jordan et. al.

– Vector components are entered successively into first processor in pipeline.

– The i-th processor of the pipeline receives the output from the (i-1)th processor.

• Normal “operations” in each processor are much larger (coarser) in pipelined computers than in true SIMDs • “Pipelined” somewhat SIMD in nature in that synchronization is not required.

53

Why Processor Arrays?

• Historically, high cost of control units • Scientific applications have data parallelism 54

Data/instruction Storage

• Front end computer – Also called the

control unit

– Holds and runs program – Data manipulated sequentially • Processor array – Data manipulated in parallel 55

Processor Array Performance

• Performance: work done per time unit • Performance of processor array – Speed of processing elements – Utilization of processing elements 56

Performance Example 1

• 1024 processors • Each adds a pair of integers in 1  sec (1 microsecond or one millionth of second or 10 -6 second.) • What is the performance when adding two 1024-element vectors (one per processor)?

Performanc e  1024 operations 1  sec  1 .

024  10 9 ops / sec 57

Performance Example 2 • 512 processors • Each adds two integers in 1  sec • What is the performance when adding two vectors of length 600?

• Since 600 > 512, 88 processor must add two pairs of integers.

• The other 424 processors add only a single pair of integers.

58

Example of a 2-D Processor Interconnection Network in a Processor Array Each VLSI chip has 16 processing elements.

Each PE can simultaneously send a value to a neighbor.

PE = processor element 59

SIMD Execution Style

• The traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]: – The sequential processor that broadcasts the commands to the rest of the processors is called the

front end

or manipulated in parallel.

control unit

.

– The front end is a general purpose CPU that stores the program and the data that is not – The front end normally executes the sequential portions of the program.

– Each processing element has a local memory that can not be directly accessed by the host or other processing elements.

60

SIMD Execution Style

– Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. – When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. – Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.

61

Masking on Processor Arrays

• All the processors work in lockstep except those that are masked out (by setting mask register).

• The conditional if-then-else is different for processor arrays than sequential version – Every active processor tests to see if its data meets the

negation

of the boolean condition.

– If it does, it sets its mask bit so those processors will not participate in the operation initially.

– Next the unmasked processors, execute the THEN part.

– Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the the ELSE part.

62

if (COND) then A else B

63

if (

COND

) then A else B

64

if (COND) then A else B

65

SIMD Machines

• An early SIMD computer designed for vector and matrix processing was the Illiac IV computer – built at the University of Illinois – See Jordan et. al., pg 7 • The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computers – See Akl pg 8-12 and [Quinn, 94] • The CRAY-1 and the Cyber-205 use pipelined arithmetic units to support vector operations and are sometimes called a pipelined SIMD – See [Jordan, et al, p7], [Quinn 94, pg 61-2], and [Quinn 2004, pg37).

66

SIMD Machines

• Quinn [1994, pg 63-67] discusses the CM-2 Connection Machine and a smaller & updated CM-200.

• Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO – ASPRO is a small second generation STARAN used by the Navy in the spy planes. • Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.

67

Today’s SIMDs

• Many SIMDs are being embedded in SISD machines.

• Others are being build as part of hybrid architectures. • Others are being build as special purpose machines, although some of them could classify as general purpose.

• Much of the recent work with SIMD architectures is proprietary.

68

A Company Building Inexpensive SIMD

WorldScape

is producing a COTS (commodity off the shelf) SIMD • Not a traditional SIMD as the hardware doesn’t synchronize every step.

– Hardware design supports efficient synchronization • Their machine is programmed like a SIMD.

• The U.S. Navy has observed that their machines process radar a magnitude faster than others.

• There is quite a bit of information about their work at http://www.wscape.com

69

An Example of a Hybrid SIMD • Embedded Massively Parallel Accelerators – Systola 1024: PC add on board with 1024 processors – Fuzion 150: 1536 processors on a single chip – Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan (This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.) 70

Hybrid Architecture

Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 High speed Myrinet switch Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 Systola 1024 – combines SIMD and MIMD paradigm within a parallel architecture 

Hybrid Computer

71

Architecture of Systola 1024

RAM NORTH RAM WEST

program memory

Controller

• Instruction Systolic Array: – 32  32 mesh of processing elements – wavefront instruction execution

host computer bus ISA Interface processors

72

SIMDs Embedded in SISDs

• Intel's Pentium 4 includes what they call MMX technology to gain a significant performance boost • IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine. • Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code. • This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.

73

Special Purpose SIMDs in the Bioinformatics Arena

• Parcel – Acquired by Celera Genomics in 2000 – Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capability • Supports over a million processors – GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing • TimeLogic, Inc – Has DeCypher, a reconfigurable SIMD 74

Advantages of SIMDs

• Reference: [Roosta, pg 10] • Less hardware than MIMDs as they have only one control unit. – Control units are complex.

• Less memory needed than MIMD – Only one copy of the instructions need to be stored – Allows more data to be stored in memory.

• Less startup time in communicating between PEs.

75

Advantages of SIMDs

• Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug.

– Similar to sequential programming • Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions.

• MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.

76

Advantages of SIMDs

• During a communication operation between PEs, – PEs send data to a neighboring PE in parallel and in lock step – No need to create a header with routing information as “routing” is determined by program steps.

– the entire communication operation is executed synchronously – A tight (worst case) upper bound for the time for this operation can be computed.

• Less complex hardware in SIMD since no message decoder is needed in PEs – MIMDs need a message decoder in each PE.

77

SIMD Shortcomings

(with some rebuttals) • Claims are from our textbook [i.e., Quinn 2004].

– Similar statements are found in [Grama, et. al]. • Claim 1: Not all problems are data-parallel – While true, most problems seem to have data parallel solutions. – In [Fox, et.al.], the observation was made in their study of large parallel applications that most were data parallel by nature, but often had points where significant branching occurred. 78

SIMD Shortcomings (with some rebuttals) • Claim 2: Speed drops for conditionally executed branches – Processors in both MIMD & SIMD normally have to do a significant amount of ‘condition’ testing – MIMDs processors can execute multiple branches concurrently.

– For an if-then-else statement with execution times for the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution • With additional branching, the average number of inactive processors can become even higher.

• With SIMDs, only one of these branches can be executed at a time.

• This reason justifies the study of multiple SIMDs (or MSIMDs).

79

SIMD Shortcomings (with some rebuttals) • Claim 2 (cont): Speed drops for conditionally executed code – In [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8.

– The cost of the extremely simple processors used in a SIMD are extremely low • Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.

80

SIMD Shortcomings

(with some rebuttals) • Claim 3: Don’t adapt to multiple users well.

– This is true to some degree for all parallel computers. – If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by ‘sharing’ – This reason also justifies the study of multiple SIMDs (or MSIMD). – SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.

81

SIMD Shortcomings

(with some rebuttals) • Claim 4: Do not scale down well to “starter” systems that are affordable.

– This point is arguable and its ‘truth’ is likely to vary rapidly over time – WorldScape/ClearSpeed currently sells a very economical SIMD board that plugs into a PC. 82

SIMD Shortcomings

(with some rebuttals) Claim 5: Requires customized VLSI for processors and expense of control units has dropped • Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS • Expense of PCs (with control units) has dropped significantly • However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.

83

SIMD Shortcomings

(with some rebuttals) Claim 5 (cont.) • There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending.

• Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities • Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.

84

Multiprocessors

• Multiprocessor: multiple-CPU computer with a shared memory • Same address on two different CPUs refers to the same memory location • Avoids three cited problems for SIMDs – Can be built from commodity CPUs – Naturally support multiple users – Maintain efficiency in conditional code 85

Centralized Multiprocessor

86

Centralized Multiprocessor

• Straightforward extension of uniprocessor • Add CPUs to bus • All processors share same primary memory • Memory access time same for all CPUs – Uniform memory access (UMA) multiprocessor • Also called a

symmetrical multiprocessor

(SMP) 87

Private and Shared Data

• Private data: items used only by a single processor • Shared data: values used by multiple processors • In a centralized multiprocessor (i.e. SMP), processors communicate via shared data values 88

Problems Associated with Shared Data The cache coherence problem • Replicating data across multiple caches reduces contention among processors for shared data values.

• But - how can we ensure different processors have the same value for same address?

• The cache coherence problem is when an obsolete value is still stored in a processor’s cache.

89

Write Invalidate Protocol Most common solution to cache coherency 1.

Each CPU’s cache controller monitors (snoops) the bus & identifies which cache blocks are requested by other CPUs.

2. A PE gains exclusive control of data item before performing “write”.

3.

Before “write” occurs, all other copies of data item cached by other PEs are invalidated.

4. When any other CPU tries to read a memory location from an invalidated cache block, – – a cache miss occurs It has to retrieve updated data from memory 90

Cache-coherence Problem

X 7 Memory Cache CPU A Cache CPU B 91

Cache-coherence Problem

X 7 Memory Read from memory is not a problem.

7 CPU A CPU B 92

Cache-coherence Problem

X 7 Memory 7 CPU A 7 CPU B 93

Cache-coherence Problem

Memory Write to main memory is a problem.

X 2 7 CPU A 2 CPU B 94

Write Invalidate Protocol

X 7 A cache control monitor snoops the bus to see which cache block is being requested by other processors.

7 7 CPU A CPU B 95

Write Invalidate Protocol

X 7 7 CPU A 7 CPU B Intent to write X Before a write can occur, all copies of data at that address are declared invalid.

96

Write Invalidate Protocol

X 7 CPU A 7 CPU B Intent to write X 97

Write Invalidate Protocol

X 2 2 When another processor tries to read from this location in cache, it receives a cache miss error and will have to refresh from main memory.

CPU A CPU B 98

Synchronization Required for Shared Data

• Mutual exclusion – Definition: At most one process can be engaged in an activity at any time.

– Example: Only one processor can write to the same address in main memory at the same time.

– We say that process must mutually exclude all others while it performs this write.

• Barrier synchronization – Definition: Guarantees that no process will proceed beyond a designated point (called the

barrier

) until every process reaches that point.

99

Distributed Multiprocessor • Distributes primary memory among processors • Increase aggregate memory bandwidth and lower average memory access time • Allows greater number of processors • Also called

non-uniform memory access

multiprocessor ( NUMA ) – Local memory access time is fast – Non-local memory access time can vary – Distributed memories have one logical address space 100

Distributed Multiprocessors

101

Cache Coherence

• Some NUMA multiprocessors do not support it in hardware – Only instructions and private data are stored in cache – Policy creates a large memory access time variance • Implementation more difficult – No shared memory bus to “snoop” – Directory-based protocol needed 102

Directory-based Protocol

• Distributed directory contains information about cacheable memory blocks • One directory entry for each cache block • Each entry has – Sharing status – Which processors have copies 103

Sharing Status

• Uncached -- (denoted by “ U ”) – Block not in any processor’s cache • Shared – (denoted by “ S ”) – Cached by one or more processors – Read only • Exclusive – (denoted by “ E ”) – Cached by exactly one processor – Processor has written block – Copy in memory is obsolete 104

Directory-based Protocol - step1

Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache CPU 0 Cache CPU 1 Cache CPU 2 105

Directories

X has value 7 – step 2

Interconnection Network Bit Vector X U

0 0 0

Memories X 7 Caches CPU 0 CPU 1 CPU 2 106

Directories

CPU 0 Reads X – step 3

Read Miss Interconnection Network X U 0 0 0 Memories X 7 Caches CPU 0 CPU 1 CPU 2 107

Directories

CPU 0 Reads X –step 4

Interconnection Network X S 1 0 0 Memories X 7 Caches CPU 0 CPU 1 CPU 2 108

Directories

CPU 0 Reads X –step 5

Interconnection Network X S 1 0 0 Memories X 7 Caches X 7 CPU 0 CPU 1 CPU 2 109

Directories

CPU 2 Reads X – step 6

Interconnection Network X S 1 0 0 Memories X 7 Read Miss Caches X 7 CPU 0 CPU 1 CPU 2 110

Directories

CPU 2 Reads X – step 7

Interconnection Network X S 1 0 1 Memories X 7 Caches X 7 CPU 0 CPU 1 CPU 2 111

Directories

CPU 2 Reads X – step 8

Interconnection Network X S 1 0 1 Memories X 7 Caches X 7 CPU 0 CPU 1 X 7 CPU 2 112

Directories

CPU 0 Writes 6 to X – step 9

Write Miss Interconnection Network X S 1 0 1 Memories X 7 Caches X 7 CPU 0 CPU 1 X 7 CPU 2 113

CPU 0 Writes 6 to X – step 10

Interconnection Network X S 1 0 1 Directories Invalidate Memories X 7 Caches X 7 CPU 0 CPU 1 X 7 CPU 2 114

CPU 0 Writes 6 to X – step 11

Interconnection Network X E 1 0 0 Directories Memories X 7 Caches X 6 CPU 0 CPU 1 CPU 2 115

Directories

CPU 1 Reads X – step 12

Interconnection Network Read Miss X E 1 0 0 Memories X 7 Caches X 6 CPU 0 CPU 1 CPU 2 116

CPU 1 Reads X – step 13

Directories Interconnection Network Switch to Shared X E 1 0 0 Memories X 7 Caches X 6 CPU 0 CPU 1 CPU 2 117

Directories

CPU 1 Reads X – step 14

Interconnection Network X E 1 0 0 Memories X 6 Caches X 6 CPU 0 CPU 1 CPU 2 118

Directories

CPU 1 Reads X – step 15

Interconnection Network X S 1 1 0 Memories X 6 Caches X 6 CPU 0 X 6 CPU 1 CPU 2 119

CPU 2 Writes 5 to X – step 16

Interconnection Network X S 1 1 0 Directories Memories Caches X 6 CPU 0 Write Miss X 6 X 6 CPU 1 CPU 2 120

Directories

CPU 2 Writes 5 to X - step 17

Invalidate Interconnection Network X S 1 1 0 Memories X 6 Caches X 6 CPU 0 X 6 CPU 1 CPU 2 121

Directories

CPU 2 Writes 5 to X –step 18

Interconnection Network X E 0 0 1 Memories Caches CPU 0 CPU 1 X 6 X 5 CPU 2 122

CPU 0 Writes 4 to X – step 19

Write Miss Interconnection Network X E 0 0 1 Directories Memories Caches CPU 0 CPU 1 X 6 X 5 CPU 2 123

CPU 0 Writes 4 to X – step 20

Interconnection Network X E 1 0 0 Directories Memories Caches CPU 0 Take Away X 6 CPU 1 X 5 CPU 2 124

CPU 0 Writes 4 to X – step 21

Interconnection Network X E 0 1 0 Directories Memories Caches CPU 0 CPU 1 X 5 X 5 CPU 2 125

CPU 0 Writes 4 to X – step 22

Interconnection Network X E 1 0 0 Directories Memories X 5 Caches CPU 0 CPU 1 CPU 2 126

CPU 0 Writes 4 to X – step 23

Interconnection Network X E 1 0 0 Directories Memories

Creates cache block storage for X

X 5 Caches X 5 CPU 0 CPU 1 CPU 2 127

CPU 0 Writes 4 to X – step 24

Interconnection Network X E 1 0 0 Directories Memories X 5 Caches X 4 CPU 0 CPU 1 CPU 2 128

CPU 0 Writes Back X Block – step 25

Interconnection Network Data Write Back X E 1 0 0 Directories Memories Caches X 4 CPU 0 CPU 1 CPU 2 129

Directories

CPU 0 flushes cache block X step 26

Interconnection Network X U 0 0 0 Memories X 4 Caches CPU 0 CPU 1 CPU 2 130

Characteristics of Multiprocessors

• Interprocessor communication is done in the memory interface by

reads

and

writes read

and

write

interconnection network may occur.

instructions • Memory may be physically distributed and the from different processors may take different time and congestion of the • Memory latency (i.e., time to complete a

read

or

write

) may be long and variable.

• Most messages through the bus or interconnection network are the size of single memory words.

• Randomization of requests may be used to reduce the probability of collisions. 131

Multicomputers

• Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing 132

Typically, Two Flavors of Multicomputers

• Commercial multicomputers – Custom switch network – Low latency something).

(the time it takes to get a response from – High bandwidth (data path width) across processors • Commodity clusters – Mass produced computers, switches and other equipment – Use low cost components – Message latency is higher – Communications bandwidth is lower 133

Multicomputer Communication

• Processors are connected by an interconnection network • Each processor has a local memory and can only access its own local memory • Data is passed between processors using messages, as dictated by the program • Data movement across the network is also asynchronous – A common approach is to use MPI to handling message passing 134

Multicomputer Communications (cont) • Multicomputers can be scaled to larger sizes much easier than multiprocessors.

• The amount of data transmissions between processors have a huge impact on the performance – The distribution of the data among the processors is a very important factor in the performance efficiency.

135

Message-Passing Advantages

• No problem with simultaneous access to data. • Allows different PCs to operate on the same data independently.

• Allows PCs on a network to be easily upgraded when faster processors become available.

136

Disadvantages of Message-Passing

• Programmers must make explicit message-passing calls in the code • This is low-level programming and is error prone.

• Data is not shared but copied, which increases the total data size.

• Data Integrity – Difficulty in maintaining correctness of multiple copies of data item. 137

Some Interconnection Network Terminology (1/2)

References

: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks.

A

link

is the connection between two nodes.

• A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed.

• The link between two nodes can be either bidirectional or use two directional links .

• Either one wire to carry one bit or parallel wires (one wire for each bit in word) can be used.

• The above choices do not have a major impact on the concepts presented in this course.

138

Network Terminology (2/2)

• The

bandwidth

is the number of bits that can be transmitted in unit time (i.e., bits per second).

• The

network latency

is the time required to transfer a message through the network.

– The

communication latency

is the total time required to send a message, including software overhead and interface delay.

– The

message latency

or

startup time

is the time required to send a zero-length message.

• Includes software & hardware overhead, such as – Choosing a route – packing and unpacking the message 139

Circuit Switching Message Passing

• Technique establishes a path and allows the entire message to transfer uninterrupted.

• Similar to telephone connection that is held until the end of the call.

• Links used are not available to other messages until the transfer is complete.

• Latency (message transfer time): If the length of control packet sent to establish path is small wrt (with respect to) the message length, the latency is essentially – the constant

L/B,

where

L

and

B

is bandwidth.

is message length 140

Store-and-forward Packet Switching

• Message is divided into “packets” of information • Each packet includes source and destination addresses.

• Packets can not exceed a fixed, maximum size (e.g., 1000 byte).

• A packet is stored in a node in a buffer until it can move to the next node.

141

Packet Switching (cont)

• At each node, the designation information is looked at and used to select which node to forward the packet to.

• Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams.

• Significant latency is created by storing each packet in each node it reaches.

• Latency increases linearly with the length of the route.

142

Virtual Cut-Through Package Switching

• Used to reduce the latency.

• Allows packet to pass through a node without being stored, if the outgoing link is available.

• If complete path is available, a message can immediately move from source to destination..

143

Wormhole Routing

• Alternate to store-and-forward packet routing • A message is divided into small units called

flits

(flow control units).

• Flits are 1-2 bytes in size.

• Can be transferred in parallel on links with multiple wires.

• Only head of flit is initially transferred when the next link becomes available.

144

Wormhole Routing (cont)

• As each flit moves forward, the next flit can move forward.

• The entire path must be reserved for a message as these packets pull each other along (like cars of a train).

• Request/acknowledge bit messages are required to coordinate these pull-along moves. – See Wilkinson, et. al.

• The complete path must be reserved, as these flits are linked together.

• Latency: If the head of the flit is very small compared to the length of the message, then the latency is essentially the constant

L

/

B,

with

L

the message length and

B

the link bandwidth.

145

Deadlock

• • • Routing algorithms needed to find a path between the nodes.

• Adaptive routing algorithms choose different paths, depending on traffic conditions.

Livelock

is a deadlock-type situation where a packet continues to go around the network, without ever reaching its destination.

Deadlock

: No packet can be forwarded because they are blocked by other stored packets waiting to be forwarded.

146

Asymmetric Multicomputers

• Has a front-end that interacts with users and I/O devices.

– Processors in back end are used for computation. – Similar to SIMDs (or processor arrays) – Common with early multicomputers • Examples of asymmetrical multicomputers given in textbook.

147

Asymmetrical MC Advantages

• Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance • Only a simple back end operating system needed  Easy for a vendor to create 148

Asymmetrical MC Disadvantages • Front-end computer is a single point of failure • Single front-end computer limits scalability of system • Primitive operating system in back-end processors makes debugging difficult • Every application requires development of both front-end and back-end program 149

Symmetric Multicomputers

• Every computer executes the same operating system and has identical functionality. • Users may log into any computer to edit or compile their programs.

• Any or all computers may be involved in the execution of their program.

• During execution of programs, every PE executes the same program. • When only one PE should execute an operation, an

“if”

statement is used to select the PE.

150

Symmetric Multicomputers

151

Symmetrical MC Advantages

• Alleviate performance bottleneck caused by single front-end computer • Better support for debugging • Every processor executes same program 152

Symmetrical MC Disadvantages • More difficult to maintain illusion of single “parallel computer” • No simple way to balance program development workload among processors • More difficult to achieve high performance when multiple processes on each processor – Details on next slide 153

Symmetric MC Disadvantages (cont) • (cont.) More difficult to achieve high performance when multiple processes on each processor – Processes on same processor compete for same resources • CPU Cycles • Cache space • Memory bandwidth – Increased cache misses • Cache is “PE oriented” instead of “process oriented” 154

ParPar Cluster, A Mixed Model

• Mixed model • Incorporates both asymetrical and symetrical designs.

155

A Commodity Cluster vs Network of Workstations

• A commodity cluster contains components of local area networks – Commodity computers – Switches • A network of workstations is a dispersed collection of computers – Distributed hetergeneous computers – Located on primary users’ desks – Unused cycles available for parallel use 156

Best Model for Commodity Cluster

• Full-Fledged operating system (e.g., Linux) desirable – Feature of symmetric multicomputer • Desirable to increase cache hits – Favors having only a single user process on each PE – Favors most nodes being off-limits for program development • Need fast network – Keep program development users off networks.

– Access front-end by another path.

• Overall, a mixed model may be best for commodity clusters 157

Ideal Commodity Cluster Features

• Co-located computers • Dedicated to running parallel jobs • No keyboards or displays • Identical operating system • Identical local disk images • Administered as an entity 158

Network of Workstations

• Dispersed computers – Typically located on user’s desks • First priority: person at keyboard • Parallel jobs run in background • Different operating systems • Different local images • Checkpointing and restarting important • Typically connected by ethernet – Too slow for commodity network usage 159

Summary

• Commercial parallel computers appeared in 1980s • Multiple-CPU computers now dominate • Small-scale: Centralized multiprocessors • Large-scale: Distributed memory architectures – Multiprocessors – Multicomputers 160