Transcript Topic 1

9/6/2006

Topic 1

Parallel Architecture Lingo & Common Used Terms

Complexity is just the lack of knowledge

eleg652-010-06F 1

9/6/2006

Reading List

• Slides: Topic1x • CullerSingh’s: Chapter 1, Chapter 2 • Other assigned readings from homework and classes eleg652-010-06F 2

Parallel Computer Lingo

Massive Parallel Processor Computer Amdahl’s Law Flynn’s Taxonomy (cache coherent) Non Uniform Memory Space Dance Hall Memory Systems Parallel Programming Models Parallelism and Concurrency Distributed Memory Machines and Beowulfs Super Scalar Computers Symmetric Multi Processor Computer Vector Computers and Processing Class A and B applications (Memory) Consistency Very Long Instruction Word Computers (Memory) Coherency 9/6/2006 eleg652-010-06F 3

The Mystifying Walls

• Power Wall – Power Dissipation, consumption and leakage of current processors • Memory Wall – A thousand cycle delay to DRAM!!!

– Memories Hierarchies and Non Blocking Memory transfers • Frequency Wall – Increasing of the pipeline depths – Diminishing results even negative when the power wall is taken into an account 9/6/2006 eleg652-010-06F 4

Parallelism and Concurrency

Concurrency

"Concurrency occurs when two or more execution flows are able to run simultaneously.” Edsger Dijkstra A

property

of execution flows

Parallelism

“The maximum number of independent subtasks in a given task at a given point in its execution.” The Free On-line Dictionary of Computing A

state

of execution flows

Parallel Computing

 Division of tasks in Parallel Computers and side effects (i.e. Memory consistency)

Concurrent Computing

9/6/2006  system (i.e. Signal Handling) Interactions between tasks in a concurrent eleg652-010-06F 5

Types of Parallelism

Application

9/6/2006

Job Level

Inter-Intra Jobs

Task Level

Loop based and function based

ISP / TLP

Between Streams or Threads of execution

Instruction Level Parallelism

Inter-Intra Instructions

Arithmetic and Bit Level

Within ALU units eleg652-010-06F

Hardware

6

Job Level Parallelism

Orchestrator:

OS, Programmer Overlap of Jobs with other CPU activities. Example: I/O retrieval

Inter-Job Parallelism:

Switch between jobs when “processing” lengthy I/O request.

Time 1 Time 2

CPU I/O

Job 1 Job 2 Job 2 Job 1

Intra-Job Parallelism:

Compute an independent calculation while waiting for a DMA transfer.

Requirements:

Duplicated resources.

A

job

is a unit

9/6/2006 eleg652-010-06F

work.

Examples

Common Memory Computational Processor Stage 1 Stage 1 2 Time 1 Job 1 Job 2 Time 2 Job 2 Job 1 I/O Processor Stage 2 Time 3 Job1 Job 3 Time 4 Job 3 Job 1

Job Phase 1:

Get args for segment 3 ...

9/6/2006

Job Phase 2:

Calculate x = a * y + z …

Job Phase 3

: Print x … JP 1 JP 2 JP 3 DMA starts eleg652-010-06F DMA ends JP 4

Job Phase 4

: args = args + x … 8

Task Level Parallelism

Orchestrator:

Compiler, Programmer, OS Task Level == Program Level Among Different Code sections Functions calls or other abstractions of code (like code blocks) Different Iteration of the same Loop Data Dependency and Program Partitioning 9/6/2006 eleg652-010-06F 9

Thread Level Parallelism

Thread

A sequence of instructions which has a single PC (program counter). Designed to run different parts of a program in different processors to take full advantage of parallelism

Orchestrator:

Programmer or Compiler 9/6/2006 eleg652-010-06F 10

Instruction Level Parallelism

• Between Instructions – Independent Instructions running on hardware resources – Assumption: There is more than one hardware resource (ALU, Adder, Multiplier, Loader, MAC unit, etc) – Data Dependency • Between Phases of Instructions: – Examples: Instruction Pipelines 9/6/2006 eleg652-010-06F 11

Principles of Pipelining

Stage 1 Stage 2 Stage 3 Stage 4 Stage N Stage 1 Stage 2 Stage 1 Stage 3 Stage 2 Stage 1 Stage 4 Stage 3 Stage 2 Stage 4 Stage 3 Stage N Stage 4 Stage N Stage N Time

Typical RISC Pipeline:

9/6/2006 Instruction Fetch Instruction Decode Execute eleg652-010-06F Memory Op Register Update 12

Example

Stage 1 Instruction Fetch Stage 2 Instruction Decode Stage 3 Execute Stage 4 Memory Op Stage 5 Register Update 9/6/2006 Stage 1 2 3 4 5 T0 I 1 T1 I 2 I 1 T2 I 3 I 2 I 1 T3 I 4 I 3 I 2 I 1 T4 I 5 I 4 I 3 I 2 I 1 Where T

n

is Time

n

, I

n

is Instruction number n eleg652-010-06F T5 I 6 I 5 I 4 I 3 I 2 13

Pipeline Hazards

• Any conditions that delay, disrupt or prevent the smooth flow of “tasks” in the pipeline system.

• Detection and resolution are major aspects of pipeline design • Types of Hazards: Structural, Control, and Data 9/6/2006 eleg652-010-06F 14

Types of [Parallel] Computers

• Flynn’s Taxonomy (1972) – Classification due on how instructions (streams) operate on data (streams) – A stream is a collection of items (instructions or data) being executed or operated by the processor • Types of Computers: SISD, SIMD, MISD and MIMD 9/6/2006 eleg652-010-06F 15

SISD

• Single Instruction Single Data • One processor is used • Pipelining is applicable • Serial Scalar Computer • Example: Intel 4004

Intel 4004 Block Diagram Courtesy of Wikipedia: The Free Online Encyclopedia

Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 I/O CU IS eleg652-010-06F IS PU DS MU 16

SIMD

• Single Instruction Multiple Data • Single (main) processor

Cray 1 Courtesy of Wikipedia: The Free Online Encyclopedia

• Vector operations  One vector instruction = many machine instructions on the data stream • Implemented as: Pipelined vector units or array of processors • Example: CRAY -1, ILLIAC-IV and ICL DAP Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 Program loaded from host IS CU eleg652-010-06F PE 1 PE n DS DS LM 1 LM n DS DS 17 Data sets loaded from host

MISD

• Multiple Instructions Single Data • Multiple Processors • Single Data stream is operated by multiple instruction streams.

• Pipeline Architectures and Systolic Arrays • Example: Colossus Mark II (1944) Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory IS ...

Memory (program and data) 9/6/2006 I/O IS DS CU 1 IS PU 1 DS eleg652-010-06F CU 2 IS PU 2 DS DS ...

...

CU 2 IS PU 2 18

MIMD

• Multiple Instruction Multiple Data • Multiple Processors • Multiple instruction streams operate on multiple data streams • Multi processors and Multi computers falls into this category.

Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 I/O

Red Storm Courtesy of Cray Inc

IS CU 1 I/O CU 1 IS eleg652-010-06F IS IS PU 1 PU n DS DS Shard Memory 19

Why Flynn’s Taxonomy is not enough

• Too Broad (especially for MIMD) • Relation with ILP?

• Shared Memory Models and Distributed Memory models?

• Relation with Parallel Programming Models?

9/6/2006 eleg652-010-06F 20

Another Classification ILP Architectures

• Multiple Instruction Issuing plus Deep Pipelining • Superscalar – Dynamic instruction dispatching – More than one instruction per cycle – Intel Pentium 4 • Very Long Instruction Word – Static instruction arrangement – More than one instruction per cycle – Texas Instruments C68 DSP processors • Super pipelined – MIPS 4000/8000 – DEC Alpha • Hybrid 9/6/2006 eleg652-010-06F 21

More Classifications Structural Composition

• Vector Computer / Processor Array – Serial Processor connected to a series of synchronized processing elements (PE) • Multiprocessors – Multiple CPU computer with shared memory – Centralized Multiprocessor • A group of processors sharing a bus and the same physical memory • Uniform Memory Access (UMA) • Symmetric Multi Processors (SMP) 9/6/2006 eleg652-010-06F 22

More Classifications Structural Composition

– Distributed Multiprocessors • Memory is distributed across several processors • Memory forms a single logical memory space • Non-uniform memory access multiprocessor (NUMA) • Multicomputers – Disjointed local address spaces for each processor – Asymmetrical Multi computers • Consists of a front end (user interaction and I/O devices) and a back end (parallel tasks) 9/6/2006 eleg652-010-06F 23

More Classifications Structural Composition

– Symmetrical Multi Computers • All components (computers) has identical functionality • Clusters and Networks of workstations • Massive Parallel Processing Machines (MPP) – Any combination of the above classes in which the number of (computational) components are in order of hundreds or even thousands – Highly Coupled with fast interconnect networks – All [Current] Supercomputers enter this range 9/6/2006 eleg652-010-06F 24

The Software Dilemma

• OS and application software depends on programming model • Programming Model  A contract between Software and Hardware • Many Programming models due to many types of Architectures.

• Dilemma: Many directions, how to unified them?

9/6/2006 eleg652-010-06F 25

9/6/2006

Topic 1a: In Depth Examples

SIMD Computers of Yesteryears eleg652-010-06F 26

Vector Processing Core

Multi Port Memory System Stream A Stream B Pipelined Adder Processor Array Stream C = A + B 9/6/2006 A vector addition is possible when two vectors (data stream) are processed by vector hardware. This being vector adder or a processor array.

eleg652-010-06F 27

ILLIAC IV

Data Sheet

Best Known For Manufacturer Number of Computer Produced First Introduced Type

One of the Biggest Failure in Supercomputing history Burroughs / University of Illinois 1 1974 SIMD / Array of processors

Specifications:

Memory Size:

[Back End] 2048 64-bit words per processor

Clock Speed:

[Back End] 13 MHz per element

Cost:

31 Million Dollars

Projected:

10 9 ops/sec & 256 Processing Elements with 4 control units

Achieved:

200 x 10 6 ops/sec & 64 Processing Elements with 1 control unit 9/6/2006 eleg652-010-06F 28

ILLIAC IV

-

Function of Control Unit on the front end

store user program decode all instructions and determine where they are to be executed execute scalar instructions broadcast vector instructions -

Function of Processing Elements in the back end :

perform the same function lock-step masking scheme data routing

Function of interconnection network

- comm. between PEs (data exchanges) 9/6/2006 eleg652-010-06F 29

ILLIAC IV Configuration 1

Front End CU Memory I/O Processor Back End 9/6/2006 P M P M P M Scalar Memory Bus Instruction broadcast bus Global Result Bus P M P M P M Interconnection Network eleg652-010-06F 30

BSP Configuration 2

CU Memory I/O Processor P1 P2 M1 M2 Scalar Memory Bus Instruction broadcast bus Global Result Bus Px P3 P4 P5 Alignment Network M3 M4 M5 My 9/6/2006 eleg652-010-06F x  Number of processors y  Number of Memory banks 31

9/6/2006

ILLIAC Mesh

56 57 58 59 60 63 0 1 2 3 7 8 15 16 9 17 10 11 18 19 23 24 25 26 27 31 32 33 39 40 41 47 48 49 55 56 57 34 42 50 58 35 43 51 59 4 12 20 28 36 44 52 60 0 1 61 62 63 5 6 7 13 14 15 21 22 23 29 30 31 37 38 39 45 46 47 53 54 55 61 62 63 5 6 7 8 16 24 32 40 48 56 0 32

BSP Dataflow

M M M M M M M M M M M M M M M M M 17 Input Alignment Network 16 Output Alignment Network 17 P P P P P P P P P P P P P P P P 16 9/6/2006 eleg652-010-06F 33

The AMT DAP 500

USER INTERFACE HOST CODE MEMORY PROCESSOR ELEMENTS HOST CONNECTION UNIT MASTER CONTROL UNIT O ACCUMULATOR A ACTIVITY CONTROL C CARRY D DATA FAST DATA CHANNEL ARRAY MEMORY 32K BITS 9/6/2006 32 32

AMT DAP 500 and DAP 510

64 by 64 one-bit PE 2D mesh eleg652-010-06F 34

SIMD across the Years

CDC 7600 (CDC, 1970) CDC Cyber 205 (Levine, 1982) ETA 10 (ETA, Inc. 1989) Cray Y-MP (Cray Research, 1989) Cray/MPP (Cray Research, 1993) Cray 1 (Russell, 1978) Fujitsu, NEC, Hitachi Models Illiac IV (Barnes et al, 1968)

(a) Multivector track

DAP 610 (AMT, Inc. 1987) Goodyear MPP (Batcher, 1980) CM2 (TMC, 1990) BSP (Kuck and Stokes, 1982) MasPar MP1 (Nickolls, 1990) IBM GF/11 (Beetem et al, 1985)

(b) SIMD track

eleg652-010-06F 35 9/6/2006

9/6/2006

Topic 1b: In Depth Examples

MIMD Computers of Today eleg652-010-06F 36

Generic MIMD Architecture

• A generic modern multiprocessor Netw ork  Mem Communication assist (CA) $ P Node: processor(s), memory system, plus

communication assist

- Network interface and communication controller • 9/6/2006 Scalable network eleg652-010-06F 37

Classification

• Shared Memory (Multiprocessor) – Single Logical Memory Space – Advantages: Simple communication and programmabilty – Problems: Scalability, consistency and coherency, synchronization • Distributed Memory (Multicomputers) – Distributed Memory Space – Advantages: Scalability – Disadvantages: Communications 9/6/2006 eleg652-010-06F 38

Distributed Memory MIMD Machines

(Multicomputers, MPPs, clusters, etc.)

9/6/2006 • • •

Message passing programming models Interconnect networks Generations/history:

1983-87: 1988-92: 1993-99: 1996 - : COSMIC CUBE iPSC/I, II software routing mesh-connected (hardware routing) Intel paragon CM-5, IBM-SP clusters eleg652-010-06F 39

Concept of

Message-Passing Model

Match Receive Y, P, t AddressY Send X, Q, t Address X Local process address space Local process address space ProcessP Process Q •

Send

specifies buffer to be transmitted and receiving process •

Recv

specifies sending process and application storage to receive into • Memory to memory copy, but • In simplest form, the send/recv match achieves pairwise synch event • Programming : MPI, etc.

need to name processes

9/6/2006 eleg652-010-06F 40

Evolution of Message-Passing Machines

• •

Early machines: FIFO on each link

• Hardware close to programming model • enabling non-blocking ops • Buffered by system at destination until recv

Diminishing role of topology

• Store&forward routing: topology important • Introduction of pipelined routing made it less so • Cost is in node-network interface • Simplifies programming 001 011 101 111 000 010 100 110 9/6/2006 eleg652-010-06F 41

Example: IBM SP-2

SP2 Node Power2 CPU L2 Memory Bus Memory Controller 4-way Interleaved DRAM Micro Channel Bus I/O i860 DMA NI 9/6/2006

64 Nodes configuration All Pictures Courtesy of IBM Research

eleg652-010-06F

RS6000 / Power2 SuperChip:

Superscalar, RISC chip

42

Example: Intel Paragon

i860 L1 i860 L1 Memory Bus (64-bit, 50 MHz) Mem Crtl 4-way Interleaved DRAM Driver DMA NI

2D Mesh Interconnect Network

8 bit 175 MHz Bidirectional

Sandia’s Intel Paragon XP/S Based Super Computer

Specs: Memory per Node:

128 MiB

Max Memory:

128 GiB

Number of Nodes:

64 to 6,768 9/6/2006

Cycle:

eleg652-010-06F 20 ns

Maximal Performance:

300 GFLOPS 43

The MANNA Multiprocessor Testbed

cluster cluster cluster cluster cluster cluster cluster Crossbar Hierarchies cluster cluster cluster cluster cluster cluster cluster cluster cluster

Node

8 Network Interface 8 CP i860XP CP i860XP I/O 32 Mbyte Memory Node

Cluster

Node Node Node Node Crossbar 4 9/6/2006 eleg652-010-06F 44

CM-5 Scalable Massively Parallel Supercomputer for 1990’s

• 10 12 million floating-point operations per second (Tera Flops) • 64,000 powerful RISC microprocessors working together • Scalable : performance grows transparently • Universal : support a vast variety of application domains • Highly reliable : sustained performance for large jobs requiring weeks/months to run.

NAS Thinking Machines CM-5 1993

9/6/2006 eleg652-010-06F 45

Distributed Memory MIMD

• Advantages – Less Contention – Highly Scalable – Simplified Synch – Message Passing  Synch + Comm.

• Disadvantages – Load Balancing – Deadlock / Livelock prone – Waste of Bandwidth – Overhead of small messages 9/6/2006 eleg652-010-06F 46

Shared Memory MIMD

• Logical or Physical Shared Memory • Low cost interconnect – Shared fabric (bus) – Switch packet fabric (Crossbar) • Logical single address space • PSM [Physical Shared Memory] and DSM [Distributed Shared Memory] P P IC P P M P P P P IC M M M M 9/6/2006 eleg652-010-06F 47

Shared-Memory Execution & Architecture Models

• • Uniform-memory-access model (UMA) Non-uniform-memory-access model (NUMA) without caches (BBN, cedar, Sequent) • COMA (Kendall Square KSR-1, DDM) CC-NUMA (DASH) Symmetric vs. Asymmetric MPs Symmetric MP Asymmetric MP master some slave) (SMPs) (some Programming: OpenMP , etc.

P P P P IC M M M M PM S S S S M P M P M P IC M P 9/6/2006 eleg652-010-06F 48

Shared Address Space Model

(e.g. pthreads)

• • Process: virtual address space plus one or more

threads

of control Portions of address spaces of processes are shared • Writes to shared address visible to other threads • Natural extension of uniprocessors model: conventional memory • operations for comm.; special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space P n pr i v at e Load P n Common physical addresses P 0 P 1 P 2 St or e Shared portion of address space P 2 pr i vat e P 1 pr i vat e Private portion of address space P 0 pr i vat e 9/6/2006 eleg652-010-06F 49

Shared Address Space Model

(Another View)

System V IPC A main structure that contains all pages in the shared memory

A Logical View

POSIX IPC Memory is defined as a file system and each page can be accessed as a file or a group of them Shared Memory S Thread Stack P Thread S S S S S S P 9/6/2006 Process P Data Segments S S Process Process Process Process Process Process P P eleg652-010-06F contiguously. Process Memory segments are given 50

Shared Address Space Architectures

• Any processor can directly reference any memory location (comm.

Implicit

) • Convenient: • Location transparency • Similar programming model to time-sharing on uniprocessors • Popularly known as

shared memory

machines or model • Ambiguous: memory may be physically distributed among processors 9/6/2006 eleg652-010-06F 51

Shared-Memory Parallel Computers

(late 90’s –early 2000’s)

• SMPs (Intel-Quad, SUN SMPs) • Supercomputers • Cray T3E • Convex 2000 • SGI Origin/Onyx • Tera Computers 9/6/2006 eleg652-010-06F 52

Scaling Up

9/6/2006 M M  M Network Network $ P $  $ P P “Dance hall” M $ P M $  P M $ P Distributed memory – interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost – Distributed memory or non-uniform memory access (NUMA) – Caching shared (particularly nonlocal) data?

eleg652-010-06F 53

9/6/2006

Shared Memory Architecture Examples

(2000 – now)

• Sun’s Wildfire Architecture (Henn&Patt, section 6.11, page 622) • Intel Xeon Multithreaded Architecture • SGI Onyx-3000 • IBM p690 • Others eleg652-010-06F 54

Example: SUN Enterprise

P $ P $ $2 $2 Mem ctrl Bus interf ace/sw itch CPU/mem cards Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards Bus interf ace • 16 cards of either type: processors + memory, or I/O • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus 9/6/2006 eleg652-010-06F 55

SUN FIRE E25K

Shared Memory Expander Board p p p p p p p p I/O Boards 9/6/2006 • 4 CPU per board: 1.350Mhz Ultra SPARC IV with 32KB I-cache and 64KB D cache, 72 CPU’s total •32 GB memory per board •Crossbar switch: 115.2(peak),43.2(sustained) GB/s bandwidth eleg652-010-06F 56

SUN FIRE 15K

Shared Memory Expander Board p p p p p p p p I/O Boards 9/6/2006 • 4 CPU per board: 900Mhz Ultra SPARC with 32KB I-cache and 64KB D-cache • 32 GB memory per board • Crossbar switch: 43 GB/s bandwidth eleg652-010-06F 57

Intel Xeon MP based server(new)

Xeon Proc Xeon Proc Xeon Proc Xeon Proc Memory Memory XMB XMB Memory Memory Northbridge NB Memory Memory XMB XMB Memory Memory I/O Hub 8xPCI-x 8xPCI-x Bridge • 4x 64Bit-3.6Ghz Xeon with 1MB L2 cache, 8MB L3 cache. • Each 2 processors share a common bus of 5.3GB/s bandwidth.

• Memory connected via 4 controllers with 5.3GB/s bandwidth each.

9/6/2006 eleg652-010-06F 58

Intel Xeon MP based server (old)

Xeon Proc Xeon Proc Xeon Proc Xeon Proc Memory Memory Memory Control Hub PCI-x Bridge PCI-x Bridge Memory Memory I/O Hub 9/6/2006 • 1.8Ghz Xeon with 512k L2 cache • 4 processor share a common bus of 6.4GB/s bandwidth • Memory share a common bus of 4.3GB/s bandwidth • Memory accessed through a memory control hub eleg652-010-06F PCI-x Bridge 59

IBM P690

I D 1Ghz cpu 1Ghz cpu L3 controller Shared L2 Cache Distributed switch I D L3 Cache Memory Proc local bus I/O bus • Each POWER4 chip has two 1Ghz processor core, shared 1.5MB

L2, directed access 32MB/chip L3, chip to chip communication logic • Each SMP building block has 4 POWER4 chips • The base p690 has up to 4 SMP building block 9/6/2006 eleg652-010-06F 60

SGI Onyx 3800

P $ P $ shared memory P $ P $ shared memory R-Brick shared memory $ $ P P shared memory $ $ P P • Each node is called a C-Brick with 2-4 processor of 600Mhz • R-Brick is a 8 by 8 cross-bar switch of 3.2GB/s bandwidth, 4 for C-Brick 4 for other R-Bricks • Each C-brick has up to 8 GB of local memory that can be accessed by all processor in the way of NUMAlink interconnect 9/6/2006 eleg652-010-06F 61

Example: Cray T3E

External I/O P $ Mem ctrl and NI Mem XY Sw itch 9/6/2006 Z • Scale up to 1024 processors, 480MB/s links • Memory controller generates comm. request for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this) eleg652-010-06F 62

Multithreaded Execution and Architecture Models

• Dataflow models • Dependent on the data for execution • Control-flow based models • Dependent on the logic of the application for the use of resources • Hybrid models • Concepts of tokens associated with both data and logic 9/6/2006 eleg652-010-06F 63

Multithreaded Execution and Architectue Models

• The “time sharing” one instruction processing unit in a pipelined fashion by all instruction streams • Use by Denelcor Heterogeneous Processor Supercomputer • Used in Hyper Threading technology (Intel’s HT) and Symmetric Multi Threaded (IBM’s SMT and general) • Key: Logical or physical replicated and independent hardware resources (i.e. C.U., register files, etc) 9/6/2006 eleg652-010-06F 64

The Denelcor HEP

1982

S1 S2 S3 S4 PSW++ S5 S6 S7 S8 50 Processes per PE PE 1 PE 2 PE 15 PE 16 Packet Switch Network DMM 1 DMM 2 DMM 127 DMM128 9/6/2006 eleg652-010-06F 65

Denelcor HEP

• Many inst. streams single P-unit • 16 PEM + 128 DMM : 64 bit/DMM • Packet-switching network • I-stream creation is under program control • 50 I-streams • Programmability : SISAL, Fortran 9/6/2006 eleg652-010-06F 66

Tera MTA (1990)

• A shared memory LIW multiprocessor • 128 fine threads have 32 registers each • to tolerate FU, synchronization and memory latency.

• Explicit-dependence look ahead increases single thread concurrency.

• Synchronization uses full/empty bits. 9/6/2006 eleg652-010-06F 67

Shared Memory MIMD

• Advantages – No Partitioning – No data movement (explicitly) – Minor modifications (or not all) of toolchains and compilers • Disadvantages – Synchronization – Scalability • High-Throughput-Low Latency network • Memory Hierarchies • DSM 9/6/2006 eleg652-010-06F 68

Future Trend of MIMD Computers

• Program execution models : beyond the SPMD model • Hybrid architecture: provide both shared memory and message-passing • Efficient mechanism for latency AND bw management –called the “memory-wall” problem 9/6/2006 eleg652-010-06F 69

Side Note 1 SPMD Model

• Single Program Multiple Data Streams • A piece of Code is run by multiple processors which operates on different data streams • Used for Shared Memory Machines as a programming model • Used by OpenMP, etc.

9/6/2006 eleg652-010-06F 70

Recent High-End MIMD Parallel Architecture Projects

• ASCI Projects (USA) • ASCI Blue • ASCI Red • ASCI Blue Mountains • HTMT Project (USA) • The Earth Simulator (Japan) • IBM BG/L Architecture • HPCS architectures (USA) • IBM Cyclops-64 Architecture • Japan and others 9/6/2006 eleg652-010-06F 71

9/6/2006

Topic 1c: Amdahl’s Law

The Reason to Study Parallelism eleg652-010-06F 72

Amdahl’s Law “When the fraction of serial work in a given problem is small, say

s

, the maximum speedup obtainable (from an even infinite number of processors) is only

1/s

.”

By Amdahl, G.: “Validity of the single processor approach to achieve large scale computer capabilities” AFIPS Conf. Proceedings 30, 1967, pp 483-485

9/6/2006 eleg652-010-06F 73

Imagine a Fixed Problem Size…

Time = 1 Serial Execution Parallel Execution

S

erial Time 9/6/2006

S

Application

X

Time

P

Time = S + P/N

P

arallel Time

Speed up

1024

S + P = 1

“ unforgivingly deepness” - S. Gustafson 91x 48x 31x 24x 1% 2% 3% 4%

S

Serial fraction

Speed Up

S

P S

P N

 1

S

P N

eleg652-010-06F 74

Revised Amdahl’s Law

Problem size grows with the number of processor

As a first approximation, we have found that it is the parallel or vector part of a program that scales (grows) with the problem size . Time for vector start-up, program loading, serial bottlenecks, and I/O that make up the

S

component of the run do not grow as rapid as with problem size.

9/6/2006 eleg652-010-06F 75

Imagine a Growing Problem Size…

Serial Execution

S

erial Time Time =

S + N * P

Application

X

Time

P

arallel Time Parallel Execution

S

Time = 1

P S + P = 1

9/6/2006 eleg652-010-06F

Speed Up

S

S P

*

N

P

S

P

*

N

1 76

9/6/2006

Topic 1d

Interconnection Networks How do you talk to others?

eleg652-010-06F 77

A Digital Computer System

Computer System Logic Memory Move Data / Communication Limiting factor for performance Network Hardware Technology

Topology

Flow Control Bandwidth Latency I.C.

QoS Routing Terminal # 9/6/2006 eleg652-010-06F 78

A Generic Computer The MIMD Example

Small Size Interconnect Network Bus and [one level] Crossbars IC Memory IC Communication Assist $ P $ P Full Featured Interconnect Networks. Packet Switching Fabrics Key: Scalable Network

Objectives

: Make efficient use of scarce comm. Resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy.

9/6/2006 eleg652-010-06F 79

I.C. Terminology

• • • • • •

Shared Medium:

Only one message at any point in time

Switched Medium:

Point to point communication

Message

: A piece of datum that needs to be moved from point A to point B

Packet

: “A sequence of data, with associated control information that is switched and transmitted as a whole” [Black Box Network Services’ Pocket Glossary of Computer Terms] .

Flits

: Flow Control Digits. Division of packets. Store and Forward and Cut Through

Phits:

Physical digits 9/6/2006 eleg652-010-06F 80

I.C. Terminology

• • • •

Topology:

Connection Pattern between communication nodes (i.e. switches, NICs, routers, etc) – Analogy: Roadmap

Routing:

The Discovery of a communication path between two computing nodes.

– Analogy: Car’s Route – Factors: Load Balancing and Length

Flow Control:

Message access patterns to a particular network resource.

– Analogy: Traffic lights, left only lanes, etc

Livelocks and Deadlocks

– [Car] Analogy: Getting Lost and Gridlock 9/6/2006 eleg652-010-06F 81

I.C. Terminology

• • •

Direct Topology:

switch nodes and the processor nodes are 1:1 The ratio between the

Indirect Topology:

switch nodes and the processor elements is more than 1:1 The ratio between the

Self Throttling Network:

A network in which congestion is reduced because the actors “detect” such congestion and begin to slow down the number of requests 9/6/2006 eleg652-010-06F 82

I.C. Terminology

Direct Topologies

9/6/2006 2D Mesh Binary Tree eleg652-010-06F 2D Torus

Indirect Topologies

83

I.C. Terminology

• • •

Diameter:

Largest distance between two switch nodes. Low Preferred.

Bisection width:

Minimum number of edges to divide the network equally in two. High Preferred

Edges per switch node:

Constant, independent of network size. Also called

Node degree.

9/6/2006 eleg652-010-06F 84

Network Architecture Topics

• Topology [Mesh Interconnect] • Routing [Dimensional Order Routing] • Flow Control [Flits’ Store and Forward] • Router Design [Locking Buffer] • Performance Analysis [Dally and Towle, 2003] 9/6/2006 eleg652-010-06F 85

9/6/2006

Mesh Type Interconnects

2D Mesh

16 nodes 16 Switches

Diameter:

6

Bisection:

4

Edges per node:

4

2D Torus

16 nodes 16 Switches

Diameter:

4

Bisection:

8

Edges per node:

4

Iliac Mesh

16 nodes 16 Switches

Diameter:

3

Bisection:

8

Edges per node:

4

4-Cube or Hypercube

16 nodes and 16 Switches Diameter: 4 Bisection: 8 Edges per node: 4

86 eleg652-010-06F

Tree Type Interconnects

Binary Tree

Eight nodes Fifteen switches

Diameter

: 6

Bisection

: 1

Edges per Node

: 3

Binary Fat Tree

9/6/2006 eleg652-010-06F

4-ary Hypertree with depth 2

Sixteen nodes Twenty Eight switches

Diameter

: 4

Bisection

: 8

Edges per Node

: 4 87

Other Types of Interconnect

Butterfly Networks

R0 R1 R2 R3 0 1 2 3 4 5 6 7 9/6/2006 • Building a butterfly – n  # of nodes • A power of 2 – n[log(n) + 1]  switches # of – Arranged in log(n) + 1 rows – Connect switch(i,j) to switch(i-1,j) and switch(i 1,m) • i  rank • j  node • m  in j inverting MSB i th bit eleg652-010-06F 88

Other Types of Interconnect

Perfect Shuffle Network

000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 000 001 010 011 100 101 110 9/6/2006 Bidirectional Exchange Links Unidirectional Shuffle Links eleg652-010-06F 111 0 4 1 5 2 6 3 7 000 001 010 011 100 101 110 111 89

Other Types of Interconnects

Star Type Network Systolic Array Network Ring Linear Array

9/6/2006 eleg652-010-06F

Different Types of Rings:

Chordial Rings degrees 3, 4, Barrel Shifter and Completely Connected 90

Common Attributes

Network Type

2D Mesh Binary Tree 4-ary hyper tree Butterfly Hypercube Shuffle-Exch

Switch Nodes n

2n-1 2n – n 0.5

n[log(n) + 1]

n n Diameter

2(n 0.5

-1) 2 log(n)

log(n) log(n) log(n)

2 log(n) -1

Bisection Width

n 0.5

1

n/2 n/2 n/2

~ n/log(n)

Edges per node

4 3 6 4 log(n) 2

Source: Quinn, Michael. “Chapter 2: Parallel Architectures. "Parallel Programming in C with MPI and OpenMP

9/6/2006 eleg652-010-06F 91

Routing An 3-ary Cube Example

Routing by Gray Code

110 010 011 111 000 001 010 011 100 Routing by

C 0

101 101 000 100 001 Nodes are called with a three bit code

C 2 C 1 C 0

000 001 010 011 100 Routing by

C 1

101 110 110 111 111 000 001 010 011 100 Routing by

C 2

101 eleg652-010-06F 110 111 92 9/6/2006

Routing Dimension Order Routing

0,3 0,2 0,1 0,0 1,3 1,2 1,1 1,0 2,3 2,2 2,1 2,0 2D Mesh with 16 nodes Node(x,y) Source (0,3) Destination(3,0)

D T

,

i

 

sign

0 ( 

i

) 9/6/2006 3,3 3,2 3,1 3,0 

i

k

2

ow

• Used for Meshes and Tori • Hop along a dimension until no hops are allowed • Change to the next dimension

m i

 

d i

s i

 mod

k

i

m i

  0

k m i

ow k

2 eleg652-010-06F 93

Routing Butterfly Routing

Routing from 2 to 5:

101 0 1 2 3 4 5 6 7 R0 Send from Node 2 to switch 2 in Rank 0 (101) Check the Most Significant Bit of destination 1  Send to right and erase it (01) R1 Check the Most Significant Bit of destination 0  Send to left and erase it (1) R2 R3 Check the Most Significant Bit of destination 1  Send to right and erase it (arrived) When right and left are not obvious, it means “send down”. Try sending 0 to 2 (010) You can also take the XOR operation from source and destination and use the result as 0 meaning

go down

and 1 meaning taking the

alternative route

(which will produce the same route). This is the way that the butterfly routing algorithms are actually computed.

9/6/2006 eleg652-010-06F 94

Interconnects & Super Computers

• Hypercubes  Cosmic Cube • Mesh  Illiac • Torus  Cray T3D • Fat Tree  CM 5 • Butterfly  BBN Butterfly 9/6/2006 eleg652-010-06F 95

9/6/2006

Topic 1e

High Performance Parallel Systems eleg652-010-06F 96

Class A and Class B Applications

Class A

– Applications that are highly parallelizable – Low in communication – Regular – Logic flow depends little on the input data – Examples: Matrix Multiply, dot product, image processing (i.e. JPEG decompressing and compressing), etc •

Class B

– Applications in which parallelism is hidden or non existent – High Communication / Synchronization needs – Logic flows depends greatly on input data (while loops, conditional structures) – Examples: Sorting methods, graph and tree searching 9/6/2006 eleg652-010-06F 97

Science Grand Challenges

• Quantum Chemistry, Statistical Mechanics and relativistic physics • Cosmology and astrophysics • Computational Fluid dynamic and turbulence • Material Design and Superconductivity • Biology, Pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine and modeling of human organs and bones • Global weather and environmental modeling 9/6/2006 eleg652-010-06F 98

A Growth-Factor of a Billion in Performance in Thirty Years

1976 Cray 1 10 6 Mega OPS 1982 Cray XMP 1991 Intel Delta 10 9 GigaOPS 1988 Cray YMP 2003 Cray X1 10 12 TeraOPS 1996 T3E 2004 BlueGene/L 10 15 PetaOPS 1997 ASCI Red 2007 IBM Cyclops-64 10 18 ExaOPS 2002 Earth Simulator 9/6/2006 eleg652-010-06F 99

What have we learned from the past decade?

• Building a general purpose parallel machines is a difficult task • Proof by Contradiction: Many companies has gone bankrupt or out of the parallel market 9/6/2006 eleg652-010-06F 100

State of Parallel Architecture Innovations

“… researchers basked in parallel-computing glory. They developed an amazing variety of parallel algorithms for every applicable sequential operation. They proposed every possible structure to interconnect thousands of processors …” However “ .. The market for massively parallel computers has

collapsed

, and many companies have gone out of business .

[IEEE Computer, Nov. 1994, pp 74-75]

Until now …

9/6/2006 eleg652-010-06F 101

State of Parallel Architecture Innovations

“ ..The term 'proprietary architecture' has become pejorative. For computer designers, the revolution is over and only 'fine tuning' remains …”

[“End of Architecture”, Burton Smith 1990s

]

Reasons for this:

lack of parallel-processing software 9/6/2006 eleg652-010-06F 102

Corporations Vanishing

Myrias 1991 BBN 1997 Multiflow 1990 ESCD 1990 Convex Computer 1994 Kendall Square Resarch 1996 MasPar 1996 Cray Research 1996 nCube 2005 1985 1989 ETA 1990 1992 1994 1992 Meiko Scientific 1994 Thinking Machines 1996 1995 Pyramid 1998 1998 DEC 2000 2005 1999 Sequent 9/6/2006 eleg652-010-06F 103

Cluster Computers

• After 20 years of false starts and dead ends in high-performance computer architecture, "the way" is now clear: Beowulf clusters are becoming

the

platform for many scientific, engineering, and commercial applications . [

Gorden Bell/2002

: A Brief History of Supercomputing] 9/6/2006 eleg652-010-06F 104

P $ M P $ M

Challenges The Killer Latency Problem

Bottlenecks!!!!

9/6/2006 NI NI Network Bottleneck!!!

• Due to: – Communication • Data Movement, reduction, broadcast, collection – Synchronization • Signals, locks, barriers, etc – Thread Management • Quantity, Cost, and Idle behavior – Load Management • Starvation, saturation, etc eleg652-010-06F 105

Application Parallelism

• Taking advantage of different levels of parallelism • Tree structures: Wave-front parallelism • Data dependencies: Dataflow execution models • Low cost synchronization constructs • TLP, ILP and speculative execution More info: Kevin B. Theobald, Guang R. Gao, and Laurie J. Hendren,

On the Limits of Program Parallelism and its Smoothability

, Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO '92), pp. 10-19, December 1992. (A longer version in ACAPS Technical Memo 40, School of Computer Science, McGill University, June 1992.) 9/6/2006 eleg652-010-06F 106

There is an “old network saying”:

• Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed - you can not bribe God.

- David Clark, MIT 9/6/2006 eleg652-010-06F 107

Proposed Solutions

• Low round trip latency on small messages is very important to many Class B applications • Minimize Synchronization and Communication latency: PIM, hardware lock free data structures, … • Fully utilize available bandwidth to hide latency: Helper / Loader threads, double buffering.

9/6/2006 eleg652-010-06F 108

The “Killer Latency” Problem

It makes the “performance debugging” of parallel programs very hard!

Performance

9/6/2006

A1 A2 An

eleg652-010-06F

“best performance” “reasonable performance” Effort + expert Knowledge

109

And Finally …

• Memory Coherency – Ensure that a memory op (a write) will become visible to all actors.

– Doesn’t impose restrictions on when it becomes visible 9/6/2006

P1

A = 0 B = 1 Print A • Memory Consistency eleg652-010-06F – Ensure that two or more memory ops has a certain order among them. Even when those operations are from different actors

P2

B = 0 A = 1 Print B 110

Parallel Computer Lingo

Massive Parallel Processor Computer Amdahl’s Law Flynn’s Taxonomy (cache coherent) Non Uniform Memory Space Dance Hall Memory Systems Parallel Programming Models Parallelism and Concurrency Distributed Memory Machines and Beowulfs Super Scalar Computers Symmetric Multi Processor Computer Vector Computers and Processing Class A and B applications (Memory) Consistency Very Long Instruction Word Computers (Memory) Coherency

Expertise Gap Constellation Super Computers Constant Edge Length

9/6/2006 eleg652-010-06F 111

Bibliography

• Quinn, Michael.

Parallel Programming in C with MPI and OpenMP

. McGraw Hill. ISBN 0 070282256-2 • Dally, William. Towles, Brian.

Principles and Practices of Interconnection Networks

. Morgan Kaufmann. ISBN 0-12-200751-4 • Black Box Network Services’ Pocket Glossary of Computer Terms • Intel Introduction to Multiprocessors: http://www.intel.com/cd/ids/developer/asmo na/eng/238663.htm?prn=y 9/6/2006 eleg652-010-06F 112