CS 258 Parallel Computer Architecture CS 258, Spring 99 David E. Culler Computer Science Division U.C.

Download Report

Transcript CS 258 Parallel Computer Architecture CS 258, Spring 99 David E. Culler Computer Science Division U.C.

CS 258
Parallel Computer Architecture
CS 258, Spring 99
David E. Culler
Computer Science Division
U.C. Berkeley
Today’s Goal:
• Introduce you to Parallel Computer Architecture
• Answer your questions about CS 258
• Provide you a sense of the trends that shape the
field
11/6/2015
CS258 S99
2
What will you get out of CS258?
• In-depth understanding of the design and
engineering of modern parallel computers
– technology forces
– fundamental architectural issues
» naming, replication, communication, synchronization
– basic design techniques
» cache coherence, protocols, networks, pipelining, …
– methods of evaluation
– underlying engineering trade-offs
• from moderate to very large scale
• across the hardware/software boundary
11/6/2015
CS258 S99
3
Will it be worthwhile?
• Absolutely!
– even through few of you will become PP designers
• The fundamental issues and solutions translate
across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform pyramid
on the most-demanding applications
– migrate downward with time
• Understand implications
for software
SuperServers
Departmenatal Servers
Workstations
Personal Computers
11/6/2015
CS258 S99
4
Am I going to read my book to you?
• NO!
• Book provides a framework and complete
background, so lectures can be more interactive.
– You do the reading
– We’ll discuss it
• Projects will go “beyond”
11/6/2015
CS258 S99
5
What is Parallel Architecture?
• A parallel computer is a collection of processing
elements that cooperate to solve large problems
fast
• Some broad issues:
– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?
11/6/2015
CS258 S99
6
Why Study Parallel Architecture?
Role of a computer architect:
To design and engineer the various levels of a computer system
to maximize performance and programmability within limits of
technology and cost.
Parallelism:
• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing
11/6/2015
CS258 S99
7
Why Study it Today?
• History: diverse and innovative organizational
structures, often tied to novel programming
models
• Rapidly maturing under strong technological
constraints
– The “killer micro” is ubiquitous
– Laptops and supercomputers are fundamentally similar!
– Technological trends cause diverse approaches to converge
• Technological trends make parallel computing
inevitable
• Need to understand fundamental principles and
design tradeoffs, not just taxonomies
– Naming, Ordering, Replication, Communication performance
11/6/2015
CS258 S99
8
Is Parallel Computing Inevitable?
• Application demands: Our insatiable need for
computing cycles
• Technology Trends
• Architecture Trends
• Economics
• Current trends:
– Today’s microprocessors have multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...
– Tomorrow’s microprocessors are multiprocessors
11/6/2015
CS258 S99
9
Application Trends
• Application demand for performance fuels
advances in hardware, which enables new appl’ns,
which...
– Cycle drives exponential increase in microprocessor
performance
– Drives parallel architecture harder
» most demanding applications
New Applications
More Performance
• Range of performance demands
– Need range of system performance with progressively increasing
cost
11/6/2015
CS258 S99
10
Speedup
• Speedup (p processors) =
Performance (p processors)
Performance (1 processor)
• For a fixed problem size (input data set),
performance = 1/time
• Speedup fixed problem (p processors) =
Time (1 processor)
Time (p processors)
11/6/2015
CS258 S99
11
Commercial Computing
• Relies on parallelism for high end
– Computational power determines scale of business that can
be handled
• Databases, online-transaction processing,
decision support, data mining, data warehousing
...
• TPC benchmarks (TPC-C order entry, TPC-D
decision support)
–
–
–
–
11/6/2015
Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size not fixed as p increases.
Throughput is performance measure (transactions per minute
or tpm)
CS258 S99
12
TPC-C Results for March 1996
25,000






Throughput (tpmC)
20,000
Tandem Himalaya
DEC Alpha
SGI PowerChallenge
HP PA
IBM PowerPC
Other

15,000

10,000









5,000 












 


 














0
0
20

40
60
Number of processors
80
100
120
• Parallelism is pervasive
• Small to moderate scale parallelism very important
• Difficult to obtain snapshot to compare across
vendor platforms
11/6/2015
CS258 S99
13
Scientific Computing Demand
11/6/2015
CS258 S99
14
Engineering Computing Demand
• Large parallel machines a mainstay in many
industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» in all of the above
» entertainment (films like Toy Story)
» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
11/6/2015
CS258 S99
15
Applications: Speech and Image
Processing
10 GI PS
1 GIP S
Te lephone
Num ber
Rec ognition
100 MIP S
10 MI PS
1 MIP S
1980
200 Words
Isolate d Spee ch
Rec ognition
Sub-Band
Spe ec h Coding
1,000 Words
Continuous
Spe ec h
Rec ognition
5,000 Words
Continuous
Spe ec h
Rec ognition
HDTVRec e ive r
CIF Video
ISDN-CD Stere o
Rec eive r
CELP
Spe ec h Coding
Spe aker
Ve ri¼c ation
1985
1990
1995
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
11/6/2015
CS258 S99
16
Is better parallel arch enough?
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
11/6/2015
CS258 S99
17
Summary of Application Trends
• Transition to parallel computing has occurred for
scientific and engineering computing
• In rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs,
which are a lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will
increase
11/6/2015
CS258 S99
18
- - - Little break - - -
11/6/2015
CS258 S99
19
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
• Today the natural building-block is also fastest!
11/6/2015
CS258 S99
20
Can’t we just wait for it to get faster?
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
alpha
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
MIPS
M2000
IBM
RS6000
540
Integer
FP
HP 9000
750
0
1987
11/6/2015
1988
1989
1990
1991
1992
CS258 S99
21
Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in 
– Number of transistors improves like  (or faster)
• Performance > 100x per decade
– clock rate < 10x, rest is transistor count
• How to use more transistors?
– Parallelism in processing
» multiple operations per cycle reduces CPI
– Locality in data access
» avoids latency and reduces CPI
» also improves processor utilization
– Both need resources, so tradeoff
Proc
$
Interconnect
• Fundamental issue is resource distribution, as in
uniprocessors
11/6/2015
CS258 S99
22
Growth Rates
100,000,000







R10000










Pentium100













 

i80386
100
10

i8086  i80286


1
i8080

10,000,000
Trans i s tors
Cloc k rate (MHz )
1,000
1,000,000
i8086


 i8008
10,000
i4004

i8080

 i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
1,000
1970
1980
1990
2000
1975
1985
1995
2005
• 30% per year
11/6/2015
100,000

 

 R10000


Pentium















i80386

i80286 
  R3000

R2000
 
40% per year
CS258 S99
23
Architectural Trends
• Architecture translates technology’s gifts into
performance and capability
• Resolves the tradeoff between parallelism and
locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural
trends
=> Helps build intuition about design issues or parallel
machines
=> Shows fundamental role of parallelism even in “sequential”
computers
11/6/2015
CS258 S99
24
Phases in “VLSI” Generation
Bit-level parallelism
Instruction-level
T hread-level (?)
100,000,000

10,000,000





1,000,000



R10000




 









 

Transistors
Pentium


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
11/6/2015
1975
1980
1985
1990
CS258 S99
1995
2000
2005
25
Architectural Trends
• Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not
performance issue)
» great inflection point when 32-bit micro and cache fit on a
chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler
advances (RISC)
» on-chip caches and functional units => superscalar
execution
» greater sophistication: out of order execution,
speculation, prediction
• to deal with control transfer and latency problems
– Next step: thread level parallelism
11/6/2015
CS258 S99
26
How far will ILP go?
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5



6+

0
Number of instructions issued
5
10
15
Instructions issued per cycle
• Infinite resources and fetch bandwidth, perfect
branch prediction and renaming
– real caches and non-zero miss latencies
11/6/2015
CS258 S99
27
Threads Level Parallelism “on board”
Proc
Proc
Proc
Proc
MEM
• Micro on a chip makes it natural to connect many to shared
memory
– dominates server and enterprise market, moving down to desktop
• Faster processors began to saturate bus, then bus
technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
11/6/2015
CS258 S99
No. of processors in fully configured commercial shared-memory systems
28
What about Multiprocessor Trends?
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60


Sun E6000
SE70
Sun SC2000
20
Symmetry21
SE10

10
Power 
SGI PowerSeries 
0
1984
1986
 SC2000E
 SGI PowerChallenge/XL
AS8400
 Sequent B8000
11/6/2015

1988

SS1000 
SS690MP 140 
SS690MP 120 
1990
1992
CS258 S99

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10
1994
1996
 P-Pro
1998
29
Bus Bandwidth
100,000
Sun E10000

Shared bus bandwidth (MB/s)
10,000
SGI
 Sun E6000
PowerCh
 AS8400
XL
 CS6400
SGI Challenge

 HPK400
 SC2000E
 AS2100
 SC2000
 P-Pro
 SS1000E
SS1000
 SS20
SS690MP 120
 SE70/SE30
SS10/
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100

 SGI PowerSeries

 Power
 Sequent B2100
Sequent
B8000
10
1984
11/6/2015
1986
1988
1990
1992
CS258 S99
1994
1996
1998
30
What about Storage Trends?
• Divergence between memory capacity and speed even more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of
hierarchy, without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
– Buffer caches most recently accessed data
• Disks too: Parallel disks plus caching
11/6/2015
CS258 S99
31
Economics
• Commodity microprocessors not only fast but CHEAP
– Development costs tens of millions of dollars
– BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity
building block
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization makes small, bus-based SMPs commodity
• Desktop: few smaller processors versus one larger one?
• Multiprocessor on a chip?
11/6/2015
CS258 S99
32
Can we see some hard evidence?
11/6/2015
CS258 S99
33
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point
performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector
supercomputers
11/6/2015
CS258 S99
34
Raw Uniprocessor Performance: LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T 94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Power2/990
MIPS R4400
Xmp/14se


DEC Alpha

 HP9000/735
 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975
11/6/2015


1980
1985
1990
CS258 S99
1995
2000
35
Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)

Paragon XP/S MP
(1024) 
 T 3D
CM-5 
100
T 932(32) 
Paragon XP/S
CM-200 
CM-2 

1
 C90(16)
Delta
10
Ymp/832(8)


 iPSC/860
 nCUBE/2(1024)
Xmp/416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
• Even vector Crays became parallel
– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
11/6/2015
CS258 S99
36
500 Fastest Computers
350
Number of systems
300 
313
200
187
 MPP
 PVP
 SMP
 198
150
110

106
100
50
0
11/93
11/6/2015
284

239

250
319


63
11/94
11/95
CS258 S99
106


73
11/96
37
Summary: Why Parallel Architecture?
• Increasingly attractive
– Economics, technology, architecture, application demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of
parallelism
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with many local
memories
• Spectrum of parallel architectures make sense
– Different cost, performance and scalability
11/6/2015
CS258 S99
38
Where is Parallel Arch Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
11/6/2015
CS258 S99
39
Today
• Extension of “computer architecture” to support
communication and cooperation
– Instruction Set Architecture plus Communication Architecture
• Defines
– Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges
today
11/6/2015
CS258 S99
40
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
Physical communication medium
11/6/2015
CS258 S99
41
How will we spend out time?
http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html
11/6/2015
CS258 S99
42
How will grading work?
•
•
•
•
30% homeworks (6)
30% exam
30% project (teams of 2)
10% participation
11/6/2015
CS258 S99
43
Any other questions?
11/6/2015
CS258 S99
44