Transcript Topic 1
9/6/2006
Topic 1
Parallel Architecture Lingo & Common Used Terms
Complexity is just the lack of knowledge
eleg652-010-06F 1
9/6/2006
Reading List
• Slides: Topic1x • CullerSingh’s: Chapter 1, Chapter 2 • Other assigned readings from homework and classes eleg652-010-06F 2
Parallel Computer Lingo
Massive Parallel Processor Computer Amdahl’s Law Flynn’s Taxonomy (cache coherent) Non Uniform Memory Space Dance Hall Memory Systems Parallel Programming Models Parallelism and Concurrency Distributed Memory Machines and Beowulfs Super Scalar Computers Symmetric Multi Processor Computer Vector Computers and Processing Class A and B applications (Memory) Consistency Very Long Instruction Word Computers (Memory) Coherency 9/6/2006 eleg652-010-06F 3
The Mystifying Walls
• Power Wall – Power Dissipation, consumption and leakage of current processors • Memory Wall – A thousand cycle delay to DRAM!!!
– Memories Hierarchies and Non Blocking Memory transfers • Frequency Wall – Increasing of the pipeline depths – Diminishing results even negative when the power wall is taken into an account 9/6/2006 eleg652-010-06F 4
Parallelism and Concurrency
Concurrency
"Concurrency occurs when two or more execution flows are able to run simultaneously.” Edsger Dijkstra A
property
of execution flows
Parallelism
“The maximum number of independent subtasks in a given task at a given point in its execution.” The Free On-line Dictionary of Computing A
state
of execution flows
Parallel Computing
Division of tasks in Parallel Computers and side effects (i.e. Memory consistency)
Concurrent Computing
9/6/2006 system (i.e. Signal Handling) Interactions between tasks in a concurrent eleg652-010-06F 5
Types of Parallelism
Application
9/6/2006
Job Level
Inter-Intra Jobs
Task Level
Loop based and function based
ISP / TLP
Between Streams or Threads of execution
Instruction Level Parallelism
Inter-Intra Instructions
Arithmetic and Bit Level
Within ALU units eleg652-010-06F
Hardware
6
Job Level Parallelism
Orchestrator:
OS, Programmer Overlap of Jobs with other CPU activities. Example: I/O retrieval
Inter-Job Parallelism:
Switch between jobs when “processing” lengthy I/O request.
Time 1 Time 2
CPU I/O
Job 1 Job 2 Job 2 Job 1
Intra-Job Parallelism:
Compute an independent calculation while waiting for a DMA transfer.
Requirements:
Duplicated resources.
A
job
is a unit
9/6/2006 eleg652-010-06F
work.
Examples
Common Memory Computational Processor Stage 1 Stage 1 2 Time 1 Job 1 Job 2 Time 2 Job 2 Job 1 I/O Processor Stage 2 Time 3 Job1 Job 3 Time 4 Job 3 Job 1
Job Phase 1:
Get args for segment 3 ...
9/6/2006
Job Phase 2:
Calculate x = a * y + z …
Job Phase 3
: Print x … JP 1 JP 2 JP 3 DMA starts eleg652-010-06F DMA ends JP 4
Job Phase 4
: args = args + x … 8
Task Level Parallelism
Orchestrator:
Compiler, Programmer, OS Task Level == Program Level Among Different Code sections Functions calls or other abstractions of code (like code blocks) Different Iteration of the same Loop Data Dependency and Program Partitioning 9/6/2006 eleg652-010-06F 9
Thread Level Parallelism
Thread
A sequence of instructions which has a single PC (program counter). Designed to run different parts of a program in different processors to take full advantage of parallelism
Orchestrator:
Programmer or Compiler 9/6/2006 eleg652-010-06F 10
Instruction Level Parallelism
• Between Instructions – Independent Instructions running on hardware resources – Assumption: There is more than one hardware resource (ALU, Adder, Multiplier, Loader, MAC unit, etc) – Data Dependency • Between Phases of Instructions: – Examples: Instruction Pipelines 9/6/2006 eleg652-010-06F 11
Principles of Pipelining
Stage 1 Stage 2 Stage 3 Stage 4 Stage N Stage 1 Stage 2 Stage 1 Stage 3 Stage 2 Stage 1 Stage 4 Stage 3 Stage 2 Stage 4 Stage 3 Stage N Stage 4 Stage N Stage N Time
Typical RISC Pipeline:
9/6/2006 Instruction Fetch Instruction Decode Execute eleg652-010-06F Memory Op Register Update 12
Example
Stage 1 Instruction Fetch Stage 2 Instruction Decode Stage 3 Execute Stage 4 Memory Op Stage 5 Register Update 9/6/2006 Stage 1 2 3 4 5 T0 I 1 T1 I 2 I 1 T2 I 3 I 2 I 1 T3 I 4 I 3 I 2 I 1 T4 I 5 I 4 I 3 I 2 I 1 Where T
n
is Time
n
, I
n
is Instruction number n eleg652-010-06F T5 I 6 I 5 I 4 I 3 I 2 13
Pipeline Hazards
• Any conditions that delay, disrupt or prevent the smooth flow of “tasks” in the pipeline system.
• Detection and resolution are major aspects of pipeline design • Types of Hazards: Structural, Control, and Data 9/6/2006 eleg652-010-06F 14
Types of [Parallel] Computers
• Flynn’s Taxonomy (1972) – Classification due on how instructions (streams) operate on data (streams) – A stream is a collection of items (instructions or data) being executed or operated by the processor • Types of Computers: SISD, SIMD, MISD and MIMD 9/6/2006 eleg652-010-06F 15
SISD
• Single Instruction Single Data • One processor is used • Pipelining is applicable • Serial Scalar Computer • Example: Intel 4004
Intel 4004 Block Diagram Courtesy of Wikipedia: The Free Online Encyclopedia
Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 I/O CU IS eleg652-010-06F IS PU DS MU 16
SIMD
• Single Instruction Multiple Data • Single (main) processor
Cray 1 Courtesy of Wikipedia: The Free Online Encyclopedia
• Vector operations One vector instruction = many machine instructions on the data stream • Implemented as: Pipelined vector units or array of processors • Example: CRAY -1, ILLIAC-IV and ICL DAP Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 Program loaded from host IS CU eleg652-010-06F PE 1 PE n DS DS LM 1 LM n DS DS 17 Data sets loaded from host
MISD
• Multiple Instructions Single Data • Multiple Processors • Single Data stream is operated by multiple instruction streams.
• Pipeline Architectures and Systolic Arrays • Example: Colossus Mark II (1944) Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory IS ...
Memory (program and data) 9/6/2006 I/O IS DS CU 1 IS PU 1 DS eleg652-010-06F CU 2 IS PU 2 DS DS ...
...
CU 2 IS PU 2 18
MIMD
• Multiple Instruction Multiple Data • Multiple Processors • Multiple instruction streams operate on multiple data streams • Multi processors and Multi computers falls into this category.
Captions: CU = Control Unit PU = Processing Unit MU = Memory Unit IS = Instruction Stream DS = Data Stream PE = Processing Element LM = Local Memory 9/6/2006 I/O
Red Storm Courtesy of Cray Inc
IS CU 1 I/O CU 1 IS eleg652-010-06F IS IS PU 1 PU n DS DS Shard Memory 19
Why Flynn’s Taxonomy is not enough
• Too Broad (especially for MIMD) • Relation with ILP?
• Shared Memory Models and Distributed Memory models?
• Relation with Parallel Programming Models?
9/6/2006 eleg652-010-06F 20
Another Classification ILP Architectures
• Multiple Instruction Issuing plus Deep Pipelining • Superscalar – Dynamic instruction dispatching – More than one instruction per cycle – Intel Pentium 4 • Very Long Instruction Word – Static instruction arrangement – More than one instruction per cycle – Texas Instruments C68 DSP processors • Super pipelined – MIPS 4000/8000 – DEC Alpha • Hybrid 9/6/2006 eleg652-010-06F 21
More Classifications Structural Composition
• Vector Computer / Processor Array – Serial Processor connected to a series of synchronized processing elements (PE) • Multiprocessors – Multiple CPU computer with shared memory – Centralized Multiprocessor • A group of processors sharing a bus and the same physical memory • Uniform Memory Access (UMA) • Symmetric Multi Processors (SMP) 9/6/2006 eleg652-010-06F 22
More Classifications Structural Composition
– Distributed Multiprocessors • Memory is distributed across several processors • Memory forms a single logical memory space • Non-uniform memory access multiprocessor (NUMA) • Multicomputers – Disjointed local address spaces for each processor – Asymmetrical Multi computers • Consists of a front end (user interaction and I/O devices) and a back end (parallel tasks) 9/6/2006 eleg652-010-06F 23
More Classifications Structural Composition
– Symmetrical Multi Computers • All components (computers) has identical functionality • Clusters and Networks of workstations • Massive Parallel Processing Machines (MPP) – Any combination of the above classes in which the number of (computational) components are in order of hundreds or even thousands – Highly Coupled with fast interconnect networks – All [Current] Supercomputers enter this range 9/6/2006 eleg652-010-06F 24
The Software Dilemma
• OS and application software depends on programming model • Programming Model A contract between Software and Hardware • Many Programming models due to many types of Architectures.
• Dilemma: Many directions, how to unified them?
9/6/2006 eleg652-010-06F 25
9/6/2006
Topic 1a: In Depth Examples
SIMD Computers of Yesteryears eleg652-010-06F 26
Vector Processing Core
Multi Port Memory System Stream A Stream B Pipelined Adder Processor Array Stream C = A + B 9/6/2006 A vector addition is possible when two vectors (data stream) are processed by vector hardware. This being vector adder or a processor array.
eleg652-010-06F 27
ILLIAC IV
Data Sheet
Best Known For Manufacturer Number of Computer Produced First Introduced Type
One of the Biggest Failure in Supercomputing history Burroughs / University of Illinois 1 1974 SIMD / Array of processors
Specifications:
Memory Size:
[Back End] 2048 64-bit words per processor
Clock Speed:
[Back End] 13 MHz per element
Cost:
31 Million Dollars
Projected:
10 9 ops/sec & 256 Processing Elements with 4 control units
Achieved:
200 x 10 6 ops/sec & 64 Processing Elements with 1 control unit 9/6/2006 eleg652-010-06F 28
ILLIAC IV
-
Function of Control Unit on the front end
store user program decode all instructions and determine where they are to be executed execute scalar instructions broadcast vector instructions -
Function of Processing Elements in the back end :
perform the same function lock-step masking scheme data routing
Function of interconnection network
- comm. between PEs (data exchanges) 9/6/2006 eleg652-010-06F 29
ILLIAC IV Configuration 1
Front End CU Memory I/O Processor Back End 9/6/2006 P M P M P M Scalar Memory Bus Instruction broadcast bus Global Result Bus P M P M P M Interconnection Network eleg652-010-06F 30
BSP Configuration 2
CU Memory I/O Processor P1 P2 M1 M2 Scalar Memory Bus Instruction broadcast bus Global Result Bus Px P3 P4 P5 Alignment Network M3 M4 M5 My 9/6/2006 eleg652-010-06F x Number of processors y Number of Memory banks 31
9/6/2006
ILLIAC Mesh
56 57 58 59 60 63 0 1 2 3 7 8 15 16 9 17 10 11 18 19 23 24 25 26 27 31 32 33 39 40 41 47 48 49 55 56 57 34 42 50 58 35 43 51 59 4 12 20 28 36 44 52 60 0 1 61 62 63 5 6 7 13 14 15 21 22 23 29 30 31 37 38 39 45 46 47 53 54 55 61 62 63 5 6 7 8 16 24 32 40 48 56 0 32
BSP Dataflow
M M M M M M M M M M M M M M M M M 17 Input Alignment Network 16 Output Alignment Network 17 P P P P P P P P P P P P P P P P 16 9/6/2006 eleg652-010-06F 33
The AMT DAP 500
USER INTERFACE HOST CODE MEMORY PROCESSOR ELEMENTS HOST CONNECTION UNIT MASTER CONTROL UNIT O ACCUMULATOR A ACTIVITY CONTROL C CARRY D DATA FAST DATA CHANNEL ARRAY MEMORY 32K BITS 9/6/2006 32 32
AMT DAP 500 and DAP 510
64 by 64 one-bit PE 2D mesh eleg652-010-06F 34
SIMD across the Years
CDC 7600 (CDC, 1970) CDC Cyber 205 (Levine, 1982) ETA 10 (ETA, Inc. 1989) Cray Y-MP (Cray Research, 1989) Cray/MPP (Cray Research, 1993) Cray 1 (Russell, 1978) Fujitsu, NEC, Hitachi Models Illiac IV (Barnes et al, 1968)
(a) Multivector track
DAP 610 (AMT, Inc. 1987) Goodyear MPP (Batcher, 1980) CM2 (TMC, 1990) BSP (Kuck and Stokes, 1982) MasPar MP1 (Nickolls, 1990) IBM GF/11 (Beetem et al, 1985)
(b) SIMD track
eleg652-010-06F 35 9/6/2006
9/6/2006
Topic 1b: In Depth Examples
MIMD Computers of Today eleg652-010-06F 36
Generic MIMD Architecture
• A generic modern multiprocessor Netw ork Mem Communication assist (CA) $ P Node: processor(s), memory system, plus
communication assist
- Network interface and communication controller • 9/6/2006 Scalable network eleg652-010-06F 37
Classification
• Shared Memory (Multiprocessor) – Single Logical Memory Space – Advantages: Simple communication and programmabilty – Problems: Scalability, consistency and coherency, synchronization • Distributed Memory (Multicomputers) – Distributed Memory Space – Advantages: Scalability – Disadvantages: Communications 9/6/2006 eleg652-010-06F 38
Distributed Memory MIMD Machines
(Multicomputers, MPPs, clusters, etc.)
9/6/2006 • • •
Message passing programming models Interconnect networks Generations/history:
1983-87: 1988-92: 1993-99: 1996 - : COSMIC CUBE iPSC/I, II software routing mesh-connected (hardware routing) Intel paragon CM-5, IBM-SP clusters eleg652-010-06F 39
Concept of
Message-Passing Model
Match Receive Y, P, t AddressY Send X, Q, t Address X Local process address space Local process address space ProcessP Process Q •
Send
specifies buffer to be transmitted and receiving process •
Recv
specifies sending process and application storage to receive into • Memory to memory copy, but • In simplest form, the send/recv match achieves pairwise synch event • Programming : MPI, etc.
need to name processes
9/6/2006 eleg652-010-06F 40
Evolution of Message-Passing Machines
• •
Early machines: FIFO on each link
• Hardware close to programming model • enabling non-blocking ops • Buffered by system at destination until recv
Diminishing role of topology
• Store&forward routing: topology important • Introduction of pipelined routing made it less so • Cost is in node-network interface • Simplifies programming 001 011 101 111 000 010 100 110 9/6/2006 eleg652-010-06F 41
Example: IBM SP-2
SP2 Node Power2 CPU L2 Memory Bus Memory Controller 4-way Interleaved DRAM Micro Channel Bus I/O i860 DMA NI 9/6/2006
64 Nodes configuration All Pictures Courtesy of IBM Research
eleg652-010-06F
RS6000 / Power2 SuperChip:
Superscalar, RISC chip
42
Example: Intel Paragon
i860 L1 i860 L1 Memory Bus (64-bit, 50 MHz) Mem Crtl 4-way Interleaved DRAM Driver DMA NI
2D Mesh Interconnect Network
8 bit 175 MHz Bidirectional
Sandia’s Intel Paragon XP/S Based Super Computer
Specs: Memory per Node:
128 MiB
Max Memory:
128 GiB
Number of Nodes:
64 to 6,768 9/6/2006
Cycle:
eleg652-010-06F 20 ns
Maximal Performance:
300 GFLOPS 43
The MANNA Multiprocessor Testbed
cluster cluster cluster cluster cluster cluster cluster Crossbar Hierarchies cluster cluster cluster cluster cluster cluster cluster cluster cluster
Node
8 Network Interface 8 CP i860XP CP i860XP I/O 32 Mbyte Memory Node
Cluster
Node Node Node Node Crossbar 4 9/6/2006 eleg652-010-06F 44
CM-5 Scalable Massively Parallel Supercomputer for 1990’s
• 10 12 million floating-point operations per second (Tera Flops) • 64,000 powerful RISC microprocessors working together • Scalable : performance grows transparently • Universal : support a vast variety of application domains • Highly reliable : sustained performance for large jobs requiring weeks/months to run.
NAS Thinking Machines CM-5 1993
9/6/2006 eleg652-010-06F 45
Distributed Memory MIMD
• Advantages – Less Contention – Highly Scalable – Simplified Synch – Message Passing Synch + Comm.
• Disadvantages – Load Balancing – Deadlock / Livelock prone – Waste of Bandwidth – Overhead of small messages 9/6/2006 eleg652-010-06F 46
Shared Memory MIMD
• Logical or Physical Shared Memory • Low cost interconnect – Shared fabric (bus) – Switch packet fabric (Crossbar) • Logical single address space • PSM [Physical Shared Memory] and DSM [Distributed Shared Memory] P P IC P P M P P P P IC M M M M 9/6/2006 eleg652-010-06F 47
Shared-Memory Execution & Architecture Models
• • Uniform-memory-access model (UMA) Non-uniform-memory-access model (NUMA) without caches (BBN, cedar, Sequent) • COMA (Kendall Square KSR-1, DDM) CC-NUMA (DASH) Symmetric vs. Asymmetric MPs Symmetric MP Asymmetric MP master some slave) (SMPs) (some Programming: OpenMP , etc.
P P P P IC M M M M PM S S S S M P M P M P IC M P 9/6/2006 eleg652-010-06F 48
Shared Address Space Model
(e.g. pthreads)
• • Process: virtual address space plus one or more
threads
of control Portions of address spaces of processes are shared • Writes to shared address visible to other threads • Natural extension of uniprocessors model: conventional memory • operations for comm.; special atomic operations for synchronization Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space P n pr i v at e Load P n Common physical addresses P 0 P 1 P 2 St or e Shared portion of address space P 2 pr i vat e P 1 pr i vat e Private portion of address space P 0 pr i vat e 9/6/2006 eleg652-010-06F 49
Shared Address Space Model
(Another View)
System V IPC A main structure that contains all pages in the shared memory
A Logical View
POSIX IPC Memory is defined as a file system and each page can be accessed as a file or a group of them Shared Memory S Thread Stack P Thread S S S S S S P 9/6/2006 Process P Data Segments S S Process Process Process Process Process Process P P eleg652-010-06F contiguously. Process Memory segments are given 50
Shared Address Space Architectures
• Any processor can directly reference any memory location (comm.
Implicit
) • Convenient: • Location transparency • Similar programming model to time-sharing on uniprocessors • Popularly known as
shared memory
machines or model • Ambiguous: memory may be physically distributed among processors 9/6/2006 eleg652-010-06F 51
Shared-Memory Parallel Computers
(late 90’s –early 2000’s)
• SMPs (Intel-Quad, SUN SMPs) • Supercomputers • Cray T3E • Convex 2000 • SGI Origin/Onyx • Tera Computers 9/6/2006 eleg652-010-06F 52
Scaling Up
9/6/2006 M M M Network Network $ P $ $ P P “Dance hall” M $ P M $ P M $ P Distributed memory – interconnect: cost (crossbar) or bandwidth (bus) – Dance-hall: bandwidth still scalable, but lower cost – Distributed memory or non-uniform memory access (NUMA) – Caching shared (particularly nonlocal) data?
eleg652-010-06F 53
9/6/2006
Shared Memory Architecture Examples
(2000 – now)
• Sun’s Wildfire Architecture (Henn&Patt, section 6.11, page 622) • Intel Xeon Multithreaded Architecture • SGI Onyx-3000 • IBM p690 • Others eleg652-010-06F 54
Example: SUN Enterprise
P $ P $ $2 $2 Mem ctrl Bus interf ace/sw itch CPU/mem cards Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards Bus interf ace • 16 cards of either type: processors + memory, or I/O • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus 9/6/2006 eleg652-010-06F 55
SUN FIRE E25K
Shared Memory Expander Board p p p p p p p p I/O Boards 9/6/2006 • 4 CPU per board: 1.350Mhz Ultra SPARC IV with 32KB I-cache and 64KB D cache, 72 CPU’s total •32 GB memory per board •Crossbar switch: 115.2(peak),43.2(sustained) GB/s bandwidth eleg652-010-06F 56
SUN FIRE 15K
Shared Memory Expander Board p p p p p p p p I/O Boards 9/6/2006 • 4 CPU per board: 900Mhz Ultra SPARC with 32KB I-cache and 64KB D-cache • 32 GB memory per board • Crossbar switch: 43 GB/s bandwidth eleg652-010-06F 57
Intel Xeon MP based server(new)
Xeon Proc Xeon Proc Xeon Proc Xeon Proc Memory Memory XMB XMB Memory Memory Northbridge NB Memory Memory XMB XMB Memory Memory I/O Hub 8xPCI-x 8xPCI-x Bridge • 4x 64Bit-3.6Ghz Xeon with 1MB L2 cache, 8MB L3 cache. • Each 2 processors share a common bus of 5.3GB/s bandwidth.
• Memory connected via 4 controllers with 5.3GB/s bandwidth each.
9/6/2006 eleg652-010-06F 58
Intel Xeon MP based server (old)
Xeon Proc Xeon Proc Xeon Proc Xeon Proc Memory Memory Memory Control Hub PCI-x Bridge PCI-x Bridge Memory Memory I/O Hub 9/6/2006 • 1.8Ghz Xeon with 512k L2 cache • 4 processor share a common bus of 6.4GB/s bandwidth • Memory share a common bus of 4.3GB/s bandwidth • Memory accessed through a memory control hub eleg652-010-06F PCI-x Bridge 59
IBM P690
I D 1Ghz cpu 1Ghz cpu L3 controller Shared L2 Cache Distributed switch I D L3 Cache Memory Proc local bus I/O bus • Each POWER4 chip has two 1Ghz processor core, shared 1.5MB
L2, directed access 32MB/chip L3, chip to chip communication logic • Each SMP building block has 4 POWER4 chips • The base p690 has up to 4 SMP building block 9/6/2006 eleg652-010-06F 60
SGI Onyx 3800
P $ P $ shared memory P $ P $ shared memory R-Brick shared memory $ $ P P shared memory $ $ P P • Each node is called a C-Brick with 2-4 processor of 600Mhz • R-Brick is a 8 by 8 cross-bar switch of 3.2GB/s bandwidth, 4 for C-Brick 4 for other R-Bricks • Each C-brick has up to 8 GB of local memory that can be accessed by all processor in the way of NUMAlink interconnect 9/6/2006 eleg652-010-06F 61
Example: Cray T3E
External I/O P $ Mem ctrl and NI Mem XY Sw itch 9/6/2006 Z • Scale up to 1024 processors, 480MB/s links • Memory controller generates comm. request for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this) eleg652-010-06F 62
Multithreaded Execution and Architecture Models
• Dataflow models • Dependent on the data for execution • Control-flow based models • Dependent on the logic of the application for the use of resources • Hybrid models • Concepts of tokens associated with both data and logic 9/6/2006 eleg652-010-06F 63
Multithreaded Execution and Architectue Models
• The “time sharing” one instruction processing unit in a pipelined fashion by all instruction streams • Use by Denelcor Heterogeneous Processor Supercomputer • Used in Hyper Threading technology (Intel’s HT) and Symmetric Multi Threaded (IBM’s SMT and general) • Key: Logical or physical replicated and independent hardware resources (i.e. C.U., register files, etc) 9/6/2006 eleg652-010-06F 64
The Denelcor HEP
1982
S1 S2 S3 S4 PSW++ S5 S6 S7 S8 50 Processes per PE PE 1 PE 2 PE 15 PE 16 Packet Switch Network DMM 1 DMM 2 DMM 127 DMM128 9/6/2006 eleg652-010-06F 65
Denelcor HEP
• Many inst. streams single P-unit • 16 PEM + 128 DMM : 64 bit/DMM • Packet-switching network • I-stream creation is under program control • 50 I-streams • Programmability : SISAL, Fortran 9/6/2006 eleg652-010-06F 66
Tera MTA (1990)
• A shared memory LIW multiprocessor • 128 fine threads have 32 registers each • to tolerate FU, synchronization and memory latency.
• Explicit-dependence look ahead increases single thread concurrency.
• Synchronization uses full/empty bits. 9/6/2006 eleg652-010-06F 67
Shared Memory MIMD
• Advantages – No Partitioning – No data movement (explicitly) – Minor modifications (or not all) of toolchains and compilers • Disadvantages – Synchronization – Scalability • High-Throughput-Low Latency network • Memory Hierarchies • DSM 9/6/2006 eleg652-010-06F 68
Future Trend of MIMD Computers
• Program execution models : beyond the SPMD model • Hybrid architecture: provide both shared memory and message-passing • Efficient mechanism for latency AND bw management –called the “memory-wall” problem 9/6/2006 eleg652-010-06F 69
Side Note 1 SPMD Model
• Single Program Multiple Data Streams • A piece of Code is run by multiple processors which operates on different data streams • Used for Shared Memory Machines as a programming model • Used by OpenMP, etc.
9/6/2006 eleg652-010-06F 70
Recent High-End MIMD Parallel Architecture Projects
• ASCI Projects (USA) • ASCI Blue • ASCI Red • ASCI Blue Mountains • HTMT Project (USA) • The Earth Simulator (Japan) • IBM BG/L Architecture • HPCS architectures (USA) • IBM Cyclops-64 Architecture • Japan and others 9/6/2006 eleg652-010-06F 71
9/6/2006
Topic 1c: Amdahl’s Law
The Reason to Study Parallelism eleg652-010-06F 72
Amdahl’s Law “When the fraction of serial work in a given problem is small, say
s
, the maximum speedup obtainable (from an even infinite number of processors) is only
1/s
.”
By Amdahl, G.: “Validity of the single processor approach to achieve large scale computer capabilities” AFIPS Conf. Proceedings 30, 1967, pp 483-485
9/6/2006 eleg652-010-06F 73
Imagine a Fixed Problem Size…
Time = 1 Serial Execution Parallel Execution
S
erial Time 9/6/2006
S
Application
X
Time
P
Time = S + P/N
P
arallel Time
Speed up
1024
S + P = 1
“ unforgivingly deepness” - S. Gustafson 91x 48x 31x 24x 1% 2% 3% 4%
S
Serial fraction
Speed Up
S
P S
P N
1
S
P N
eleg652-010-06F 74
Revised Amdahl’s Law
Problem size grows with the number of processor
As a first approximation, we have found that it is the parallel or vector part of a program that scales (grows) with the problem size . Time for vector start-up, program loading, serial bottlenecks, and I/O that make up the
S
component of the run do not grow as rapid as with problem size.
9/6/2006 eleg652-010-06F 75
Imagine a Growing Problem Size…
Serial Execution
S
erial Time Time =
S + N * P
Application
X
Time
P
arallel Time Parallel Execution
S
Time = 1
P S + P = 1
9/6/2006 eleg652-010-06F
Speed Up
S
S P
*
N
P
S
P
*
N
1 76
9/6/2006
Topic 1d
Interconnection Networks How do you talk to others?
eleg652-010-06F 77
A Digital Computer System
Computer System Logic Memory Move Data / Communication Limiting factor for performance Network Hardware Technology
Topology
Flow Control Bandwidth Latency I.C.
QoS Routing Terminal # 9/6/2006 eleg652-010-06F 78
A Generic Computer The MIMD Example
Small Size Interconnect Network Bus and [one level] Crossbars IC Memory IC Communication Assist $ P $ P Full Featured Interconnect Networks. Packet Switching Fabrics Key: Scalable Network
Objectives
: Make efficient use of scarce comm. Resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy.
9/6/2006 eleg652-010-06F 79
I.C. Terminology
• • • • • •
Shared Medium:
Only one message at any point in time
Switched Medium:
Point to point communication
Message
: A piece of datum that needs to be moved from point A to point B
Packet
: “A sequence of data, with associated control information that is switched and transmitted as a whole” [Black Box Network Services’ Pocket Glossary of Computer Terms] .
Flits
: Flow Control Digits. Division of packets. Store and Forward and Cut Through
Phits:
Physical digits 9/6/2006 eleg652-010-06F 80
I.C. Terminology
• • • •
Topology:
Connection Pattern between communication nodes (i.e. switches, NICs, routers, etc) – Analogy: Roadmap
Routing:
The Discovery of a communication path between two computing nodes.
– Analogy: Car’s Route – Factors: Load Balancing and Length
Flow Control:
Message access patterns to a particular network resource.
– Analogy: Traffic lights, left only lanes, etc
Livelocks and Deadlocks
– [Car] Analogy: Getting Lost and Gridlock 9/6/2006 eleg652-010-06F 81
I.C. Terminology
• • •
Direct Topology:
switch nodes and the processor nodes are 1:1 The ratio between the
Indirect Topology:
switch nodes and the processor elements is more than 1:1 The ratio between the
Self Throttling Network:
A network in which congestion is reduced because the actors “detect” such congestion and begin to slow down the number of requests 9/6/2006 eleg652-010-06F 82
I.C. Terminology
Direct Topologies
9/6/2006 2D Mesh Binary Tree eleg652-010-06F 2D Torus
Indirect Topologies
83
I.C. Terminology
• • •
Diameter:
Largest distance between two switch nodes. Low Preferred.
Bisection width:
Minimum number of edges to divide the network equally in two. High Preferred
Edges per switch node:
Constant, independent of network size. Also called
Node degree.
9/6/2006 eleg652-010-06F 84
Network Architecture Topics
• Topology [Mesh Interconnect] • Routing [Dimensional Order Routing] • Flow Control [Flits’ Store and Forward] • Router Design [Locking Buffer] • Performance Analysis [Dally and Towle, 2003] 9/6/2006 eleg652-010-06F 85
9/6/2006
Mesh Type Interconnects
2D Mesh
16 nodes 16 Switches
Diameter:
6
Bisection:
4
Edges per node:
4
2D Torus
16 nodes 16 Switches
Diameter:
4
Bisection:
8
Edges per node:
4
Iliac Mesh
16 nodes 16 Switches
Diameter:
3
Bisection:
8
Edges per node:
4
4-Cube or Hypercube
16 nodes and 16 Switches Diameter: 4 Bisection: 8 Edges per node: 4
86 eleg652-010-06F
Tree Type Interconnects
Binary Tree
Eight nodes Fifteen switches
Diameter
: 6
Bisection
: 1
Edges per Node
: 3
Binary Fat Tree
9/6/2006 eleg652-010-06F
4-ary Hypertree with depth 2
Sixteen nodes Twenty Eight switches
Diameter
: 4
Bisection
: 8
Edges per Node
: 4 87
Other Types of Interconnect
Butterfly Networks
R0 R1 R2 R3 0 1 2 3 4 5 6 7 9/6/2006 • Building a butterfly – n # of nodes • A power of 2 – n[log(n) + 1] switches # of – Arranged in log(n) + 1 rows – Connect switch(i,j) to switch(i-1,j) and switch(i 1,m) • i rank • j node • m in j inverting MSB i th bit eleg652-010-06F 88
Other Types of Interconnect
Perfect Shuffle Network
000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 000 001 010 011 100 101 110 9/6/2006 Bidirectional Exchange Links Unidirectional Shuffle Links eleg652-010-06F 111 0 4 1 5 2 6 3 7 000 001 010 011 100 101 110 111 89
Other Types of Interconnects
Star Type Network Systolic Array Network Ring Linear Array
9/6/2006 eleg652-010-06F
Different Types of Rings:
Chordial Rings degrees 3, 4, Barrel Shifter and Completely Connected 90
Common Attributes
Network Type
2D Mesh Binary Tree 4-ary hyper tree Butterfly Hypercube Shuffle-Exch
Switch Nodes n
2n-1 2n – n 0.5
n[log(n) + 1]
n n Diameter
2(n 0.5
-1) 2 log(n)
log(n) log(n) log(n)
2 log(n) -1
Bisection Width
n 0.5
1
n/2 n/2 n/2
~ n/log(n)
Edges per node
4 3 6 4 log(n) 2
Source: Quinn, Michael. “Chapter 2: Parallel Architectures. "Parallel Programming in C with MPI and OpenMP
9/6/2006 eleg652-010-06F 91
Routing An 3-ary Cube Example
Routing by Gray Code
110 010 011 111 000 001 010 011 100 Routing by
C 0
101 101 000 100 001 Nodes are called with a three bit code
C 2 C 1 C 0
000 001 010 011 100 Routing by
C 1
101 110 110 111 111 000 001 010 011 100 Routing by
C 2
101 eleg652-010-06F 110 111 92 9/6/2006
Routing Dimension Order Routing
0,3 0,2 0,1 0,0 1,3 1,2 1,1 1,0 2,3 2,2 2,1 2,0 2D Mesh with 16 nodes Node(x,y) Source (0,3) Destination(3,0)
D T
,
i
sign
0 (
i
) 9/6/2006 3,3 3,2 3,1 3,0
i
k
2
ow
• Used for Meshes and Tori • Hop along a dimension until no hops are allowed • Change to the next dimension
m i
d i
s i
mod
k
i
m i
0
k m i
ow k
2 eleg652-010-06F 93
Routing Butterfly Routing
Routing from 2 to 5:
101 0 1 2 3 4 5 6 7 R0 Send from Node 2 to switch 2 in Rank 0 (101) Check the Most Significant Bit of destination 1 Send to right and erase it (01) R1 Check the Most Significant Bit of destination 0 Send to left and erase it (1) R2 R3 Check the Most Significant Bit of destination 1 Send to right and erase it (arrived) When right and left are not obvious, it means “send down”. Try sending 0 to 2 (010) You can also take the XOR operation from source and destination and use the result as 0 meaning
go down
and 1 meaning taking the
alternative route
(which will produce the same route). This is the way that the butterfly routing algorithms are actually computed.
9/6/2006 eleg652-010-06F 94
Interconnects & Super Computers
• Hypercubes Cosmic Cube • Mesh Illiac • Torus Cray T3D • Fat Tree CM 5 • Butterfly BBN Butterfly 9/6/2006 eleg652-010-06F 95
9/6/2006
Topic 1e
High Performance Parallel Systems eleg652-010-06F 96
Class A and Class B Applications
•
Class A
– Applications that are highly parallelizable – Low in communication – Regular – Logic flow depends little on the input data – Examples: Matrix Multiply, dot product, image processing (i.e. JPEG decompressing and compressing), etc •
Class B
– Applications in which parallelism is hidden or non existent – High Communication / Synchronization needs – Logic flows depends greatly on input data (while loops, conditional structures) – Examples: Sorting methods, graph and tree searching 9/6/2006 eleg652-010-06F 97
Science Grand Challenges
• Quantum Chemistry, Statistical Mechanics and relativistic physics • Cosmology and astrophysics • Computational Fluid dynamic and turbulence • Material Design and Superconductivity • Biology, Pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine and modeling of human organs and bones • Global weather and environmental modeling 9/6/2006 eleg652-010-06F 98
A Growth-Factor of a Billion in Performance in Thirty Years
1976 Cray 1 10 6 Mega OPS 1982 Cray XMP 1991 Intel Delta 10 9 GigaOPS 1988 Cray YMP 2003 Cray X1 10 12 TeraOPS 1996 T3E 2004 BlueGene/L 10 15 PetaOPS 1997 ASCI Red 2007 IBM Cyclops-64 10 18 ExaOPS 2002 Earth Simulator 9/6/2006 eleg652-010-06F 99
What have we learned from the past decade?
• Building a general purpose parallel machines is a difficult task • Proof by Contradiction: Many companies has gone bankrupt or out of the parallel market 9/6/2006 eleg652-010-06F 100
State of Parallel Architecture Innovations
“… researchers basked in parallel-computing glory. They developed an amazing variety of parallel algorithms for every applicable sequential operation. They proposed every possible structure to interconnect thousands of processors …” However “ .. The market for massively parallel computers has
collapsed
, and many companies have gone out of business .
[IEEE Computer, Nov. 1994, pp 74-75]
Until now …
9/6/2006 eleg652-010-06F 101
State of Parallel Architecture Innovations
“ ..The term 'proprietary architecture' has become pejorative. For computer designers, the revolution is over and only 'fine tuning' remains …”
[“End of Architecture”, Burton Smith 1990s
]
Reasons for this:
lack of parallel-processing software 9/6/2006 eleg652-010-06F 102
Corporations Vanishing
Myrias 1991 BBN 1997 Multiflow 1990 ESCD 1990 Convex Computer 1994 Kendall Square Resarch 1996 MasPar 1996 Cray Research 1996 nCube 2005 1985 1989 ETA 1990 1992 1994 1992 Meiko Scientific 1994 Thinking Machines 1996 1995 Pyramid 1998 1998 DEC 2000 2005 1999 Sequent 9/6/2006 eleg652-010-06F 103
Cluster Computers
• After 20 years of false starts and dead ends in high-performance computer architecture, "the way" is now clear: Beowulf clusters are becoming
the
platform for many scientific, engineering, and commercial applications . [
Gorden Bell/2002
: A Brief History of Supercomputing] 9/6/2006 eleg652-010-06F 104
P $ M P $ M
Challenges The Killer Latency Problem
Bottlenecks!!!!
9/6/2006 NI NI Network Bottleneck!!!
• Due to: – Communication • Data Movement, reduction, broadcast, collection – Synchronization • Signals, locks, barriers, etc – Thread Management • Quantity, Cost, and Idle behavior – Load Management • Starvation, saturation, etc eleg652-010-06F 105
Application Parallelism
• Taking advantage of different levels of parallelism • Tree structures: Wave-front parallelism • Data dependencies: Dataflow execution models • Low cost synchronization constructs • TLP, ILP and speculative execution More info: Kevin B. Theobald, Guang R. Gao, and Laurie J. Hendren,
On the Limits of Program Parallelism and its Smoothability
, Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO '92), pp. 10-19, December 1992. (A longer version in ACAPS Technical Memo 40, School of Computer Science, McGill University, June 1992.) 9/6/2006 eleg652-010-06F 106
There is an “old network saying”:
• Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed - you can not bribe God.
- David Clark, MIT 9/6/2006 eleg652-010-06F 107
Proposed Solutions
• Low round trip latency on small messages is very important to many Class B applications • Minimize Synchronization and Communication latency: PIM, hardware lock free data structures, … • Fully utilize available bandwidth to hide latency: Helper / Loader threads, double buffering.
9/6/2006 eleg652-010-06F 108
The “Killer Latency” Problem
It makes the “performance debugging” of parallel programs very hard!
Performance
9/6/2006
A1 A2 An
eleg652-010-06F
“best performance” “reasonable performance” Effort + expert Knowledge
109
And Finally …
• Memory Coherency – Ensure that a memory op (a write) will become visible to all actors.
– Doesn’t impose restrictions on when it becomes visible 9/6/2006
P1
A = 0 B = 1 Print A • Memory Consistency eleg652-010-06F – Ensure that two or more memory ops has a certain order among them. Even when those operations are from different actors
P2
B = 0 A = 1 Print B 110
Parallel Computer Lingo
Massive Parallel Processor Computer Amdahl’s Law Flynn’s Taxonomy (cache coherent) Non Uniform Memory Space Dance Hall Memory Systems Parallel Programming Models Parallelism and Concurrency Distributed Memory Machines and Beowulfs Super Scalar Computers Symmetric Multi Processor Computer Vector Computers and Processing Class A and B applications (Memory) Consistency Very Long Instruction Word Computers (Memory) Coherency
Expertise Gap Constellation Super Computers Constant Edge Length
9/6/2006 eleg652-010-06F 111
Bibliography
• Quinn, Michael.
Parallel Programming in C with MPI and OpenMP
. McGraw Hill. ISBN 0 070282256-2 • Dally, William. Towles, Brian.
Principles and Practices of Interconnection Networks
. Morgan Kaufmann. ISBN 0-12-200751-4 • Black Box Network Services’ Pocket Glossary of Computer Terms • Intel Introduction to Multiprocessors: http://www.intel.com/cd/ids/developer/asmo na/eng/238663.htm?prn=y 9/6/2006 eleg652-010-06F 112