Distributed Memory Machines and Programming James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 5 Recap of Last Lecture • Shared memory multiprocessors • Caches may be either shared or.

Transcript Distributed Memory Machines and Programming James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 5 Recap of Last Lecture • Shared memory multiprocessors • Caches may be either shared or.

Distributed Memory
Machines and Programming
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr09
CS267 Lecture 5
1
Recap of Last Lecture
• Shared memory multiprocessors
• Caches may be either shared or distributed.
• Multicore chips are likely to have shared caches
• Cache hit performance is better if they are distributed
(each cache is smaller/closer) but they must be kept
coherent -- multiple cached copies of same location must
be kept equal.
• Requires clever hardware (see CS258, CS252).
• Distant memory much more expensive to access.
• Machines scale to 10s or 100s of processors.
• Shared memory programming
• Starting, stopping threads.
• Communication by reading/writing shared variables.
• Synchronization with locks, barriers.
02/4/2009
CS267 Lecture 5
2
Outline
• Distributed Memory Architectures
• Properties of communication networks
• Topologies
• Performance models
• Programming Distributed Memory Machines
using Message Passing
• Overview of MPI
• Basic send/receive use
• Non-blocking communication
• Collectives
• (may continue into next lecture)
02/4/2009
CS267 Lecture 5
3
Historical Perspective
• Early machines were:
• Collection of microprocessors.
• Communication was performed using bi-directional queues
between nearest neighbors.
• Messages were forwarded by processors on path.
• “Store and forward” networking
• There was a strong emphasis on topology in algorithms,
in order to minimize the number of hops = minimize time
02/4/2009
CS267 Lecture 5
4
Network Analogy
• To have a large number of different transfers occurring at
once, you need a large number of distinct wires
• Not just a bus, as in shared memory
• Networks are like streets:
• Link = street.
• Switch = intersection.
• Distances (hops) = number of blocks traveled.
• Routing algorithm = travel plan.
• Properties:
• Latency: how long to get between nodes in the network.
• Bandwidth: how much data can be moved per unit time.
• Bandwidth is limited by the number of wires and the rate at
which each wire can accept data.
02/4/2009
CS267 Lecture 5
5
Design Characteristics of a Network
• Topology (how things are connected)
• Crossbar, ring, 2-D and 3-D mesh or torus,
hypercube, tree, butterfly, perfect shuffle ....
• Routing algorithm:
• Example in 2D torus: all east-west then all
north-south (avoids deadlock).
• Switching strategy:
• Circuit switching: full path reserved for entire
message, like the telephone.
• Packet switching: message broken into separatelyrouted packets, like the post office.
• Flow control (what if there is congestion):
• Stall, store data temporarily in buffers, re-route data
to other nodes, tell source node to temporarily halt,
discard, etc.
02/4/2009
CS267 Lecture 5
6
Performance Properties of a Network: Latency
• Diameter: the maximum (over all pairs of nodes) of the
shortest path between a given pair of nodes.
• Latency: delay between send and receive times
• Latency tends to vary widely across architectures
• Vendors often report hardware latencies (wire time)
• Application programmers care about software
latencies (user program to user program)
• Observations:
• Hardware/software latencies often differ by 1-2
orders of magnitude
• Maximum hardware latency varies with diameter, but
the variation in software latency is usually negligible
• Latency is important for programs with many small
messages
02/4/2009
CS267 Lecture 5
7
Latency on Some Recent Machines/Networks
8-byte Roundtrip Latency
24.2
25
22.1
Roundtrip Latency (usec)
MPI ping-pong
20
15
18.5
14.6
9.6
10
6.6
5
0
Elan3/Alpha
Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
• Latencies shown are from a ping-pong test using MPI
• These are roundtrip numbers: many people use ½ of roundtrip time
to approximate 1-way latency (which can’t easily be measured)
02/4/2009
CS267 Lecture 5
8
End to End Latency (1/2 roundtrip) Over Time
100
nCube/2
CS2
SP2
CM5
Paragon
usec
CM5
36.34
SP1CS2
nCube/2
T3D
T3D
T3E18.916
Myrinet
KSR
10
Cenju3
12.0805
11.027
SP-Power39.25
7.2755
6.9745
6.905
4.81
SPP
Quadrics
SPP
3.3
2.6
Quadrics
T3E
1
1990
1995
2000
Year (approximate)
2005
2010
• Latency has not improved significantly, unlike Moore’s Law
• T3E (shmem) was lowest point – in 1997
Data from Kathy Yelick, UCB and NERSC
02/4/2009
CS267 Lecture 5
9
Performance Properties of a Network: Bandwidth
• The bandwidth of a link = # wires / time-per-bit
• Bandwidth typically in Gigabytes/sec (GB/s),
i.e., 8* 220 bits per second
• Effective bandwidth is usually lower than physical link
bandwidth due to packet overhead.
Routing
and control
header
• Bandwidth is important for applications
with mostly large messages
Data
payload
Error code
Trailer
02/4/2009
CS267 Lecture 5
10
Bandwidth on Existing Networks
Flood Bandwidth for 2MB messages
100%
857
1504
225
MPI
Percent HW peak (BW in MB)
90%
244
80%
610
70%
630
60%
50%
40%
30%
20%
10%
0%
Elan3/Alpha
Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
• Flood bandwidth (throughput of back-to-back 2MB messages)
02/4/2009
CS267 Lecture 5
11
Bandwidth Chart
400
Note: bandwidth depends on SW, not just HW
350
Bandwidth (MB/sec)
300
T3E/MPI
T3E/Shmem
IBM/MPI
IBM/LAPI
Compaq/Put
Compaq/Get
M2K/MPI
M2K/GM
Dolphin/MPI
Giganet/VIPL
SysKonnect
250
200
150
100
50
0
2048
4096
8192
16384
32768
65536
131072
Message Size (Bytes)
Data from Mike Welcome, NERSC
02/4/2009
CS267 Lecture 5
12
Performance Properties of a Network: Bisection Bandwidth
• Bisection bandwidth: bandwidth across smallest cut that
divides network into two equal halves
• Bandwidth across “narrowest” part of the network
not a
bisection
cut
bisection
cut
bisection bw= link bw
bisection bw = sqrt(n) * link bw
• Bisection bandwidth is important for algorithms in which
all processors need to communicate with all others
02/4/2009
CS267 Lecture 5
13
Network Topology
• In the past, there was considerable research in network
topology and in mapping algorithms to topology.
• Key cost to be minimized: number of “hops” between
nodes (e.g. “store and forward”)
• Modern networks hide hop cost (i.e., “wormhole
routing”), so topology is no longer a major factor in
algorithm performance.
• Example: On IBM SP system, hardware latency varies
from 0.5 usec to 1.5 usec, but user-level message
passing latency is roughly 36 usec.
• Need some background in network topology
• Algorithms may have a communication topology
• Topology affects bisection bandwidth.
02/4/2009
CS267 Lecture 5
14
Linear and Ring Topologies
• Linear array
• Diameter = n-1; average distance ~n/3.
• Bisection bandwidth = 1 (in units of link bandwidth).
• Torus or Ring
• Diameter = n/2; average distance ~ n/4.
• Bisection bandwidth = 2.
• Natural for algorithms that work with 1D arrays.
02/4/2009
CS267 Lecture 5
15
Meshes and Tori
Two dimensional mesh
Two dimensional torus
• Diameter = 2 * (sqrt( n ) – 1)
• Diameter = sqrt( n )
• Bisection bandwidth = sqrt(n) • Bisection bandwidth = 2* sqrt(n)
• Generalizes to higher dimensions (Cray T3D used 3D Torus).
• Natural for algorithms that work with 2D and/or 3D arrays (matmul)
02/4/2009
CS267 Lecture 5
16
Hypercubes
• Number of nodes n = 2d for dimension d.
• Diameter = d.
• Bisection bandwidth = n/2.
• 0d
1d
2d
3d
4d
• Popular in early machines (Intel iPSC, NCUBE).
• Lots of clever algorithms.
• See 1996 online CS267 notes.
• Greycode addressing:
010
100
• Each node connected to
d others with 1 bit different.
02/4/2009
110
000
CS267 Lecture 5
111
011
101
001
17
Trees
•
•
•
•
•
Diameter = log n.
Bisection bandwidth = 1.
Easy layout as planar graph.
Many tree algorithms (e.g., summation).
Fat trees avoid bisection bandwidth problem:
• More (or wider) links near top.
• Example: Thinking Machines CM-5.
02/4/2009
CS267 Lecture 5
18
Butterflies
•
•
•
•
•
Diameter = log n.
Bisection bandwidth = n.
Cost: lots of wires.
Used in BBN Butterfly.
Natural for FFT.
O
1
O
1
O
1
O
1
Ex: to get from proc 101 to 110,
Compare bit-by-bit and
Switch if they disagree, else not
butterfly switch
multistage butterfly network
02/4/2009
CS267 Lecture 5
19
older
newer
Topologies in Real Machines
Cray XT3 and XT4
3D Torus (approx)
Blue Gene/L
3D Torus
SGI Altix
Fat tree
Cray X1
4D Hypercube*
Myricom (Millennium)
Arbitrary
Quadrics (in HP Alpha
server clusters)
Fat tree
IBM SP
Fat tree (approx)
SGI Origin
Hypercube
Intel Paragon (old)
2D Mesh
BBN Butterfly (really old) Butterfly
02/4/2009
CS267 Lecture 5
Many of these are
approximations:
E.g., the X1 is really a
“quad bristled
hypercube” and some
of the fat trees are
not as fat as they
should be at the top
20
Evolution of Distributed Memory Machines
• Special queue connections are being replaced by direct
memory access (DMA):
• Processor packs or copies messages.
• Initiates transfer, goes on computing.
• Wormhole routing in hardware:
• Special message processors do not interrupt main processors along
path.
• Message sends are pipelined.
• Processors don’t wait for complete message before forwarding
• Message passing libraries provide store-and-forward
abstraction:
• Can send/receive between any pair of nodes, not just along one wire.
• Time depends on distance since each processor along path must
participate.
02/4/2009
CS267 Lecture 5
21
Performance
Models
CS267 Lecture 5
22
Shared Memory Performance Models
• Parallel Random Access Memory (PRAM)
• All memory access operations complete in one clock
period -- no concept of memory hierarchy (“too good to
be true”).
• OK for understanding whether an algorithm has enough
parallelism at all (see CS273).
• Parallel algorithm design strategy: first do a PRAM algorithm,
then worry about memory/communication time (sometimes
works)
• Slightly more realistic versions exist
• E.g., Concurrent Read Exclusive Write (CREW) PRAM.
• Still missing the memory hierarchy
02/4/2009
CS267 Lecture 5
23
Drop Page
Fields Here
Measured Message
Time
Sum of gap
10000
machine
1000
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
100
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
GigE/VIPL
10
GigE/MPI
1
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536 131072
size
02/4/2009
CS267 Lecture 5
24
Latency and Bandwidth Model
• Time to send message of length n is roughly
Time = latency + n*cost_per_word
= latency + n/bandwidth
• Topology is assumed irrelevant.
• Often called “a-b model” and written
Time = a + n*b
• Usually a >> b >> time per flop.
• One long message is cheaper than many short ones.
a + n*b << n*(a + 1*b)
• Can do hundreds or thousands of flops for cost of one message.
• Lesson: Need large computation-to-communication ratio
to be efficient.
02/4/2009
CS267 Lecture 5
25
Alpha-Beta Parameters on Current Machines
• These numbers were obtained empirically
machine
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
Quadrics/Get
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
Dolphin/MPI
Giganet/VIPL
GigE/VIPL
GigE/MPI
02/4/2009
a
b
1.2
0.003
6.7
0.003
9.4
0.003
7.6
0.004
3.267 0.00498
1.3
0.005
7.3
0.005
7.7
0.005
7.2
0.006
7.767 0.00529
3.0
0.010
4.6
0.008
5.854 0.00872
CS267 Lecture 5
a is latency in usecs
b is BW in usecs per Byte
How well does the model
Time = a + n*b
predict actual performance?
26
Drop Page Fields Here
Model Time Varying Message Size & Machines
Sum of model
10000
1000
machine
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
100
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
GigE/VIPL
10
GigE/MPI
1
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536 131072
size
02/4/2009
CS267 Lecture 5
27
Drop Page
Fields Here
Measured Message
Time
Sum of gap
10000
machine
1000
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
100
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
GigE/VIPL
10
GigE/MPI
1
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536 131072
size
02/4/2009
CS267 Lecture 5
28
LogP Parameters: Overhead & Latency
• Non-overlapping
overhead
• Send and recv overhead
can overlap
P0
P0
osend
osend
L
orecv
orecv
P1
P1
EEL = End-to-End Latency
= osend + L + orecv
02/4/2009
EEL = f(osend, L, orecv)
 max(osend, L, orecv)
CS267 Lecture 5
29
LogP Parameters: gap
• The Gap is the delay between sending
messages
• Gap could be greater than send overhead
• NIC may be busy finishing the
gap
processing of last message and
cannot accept a new one.
• Flow control or backpressure on the gap
network may prevent the NIC from
accepting the next message to send.
• No overlap 
time to send n messages (pipelined) =
P0
(osend + L + orecv - gap) + n*gap = α + n*β
02/4/2009
CS267 Lecture
osend
P1
30
Results: EEL and Overhead
25
usec
20
15
10
5
T3
T3
E/
M
PI
E/
Sh
T3 m e
E/ m
ER
IB eg
M
/M
PI
IB
Q M/L
ua
AP
dr
I
ic
s
Q
ua /MP
dr
I
i
c
Q
ua s/P
ut
dr
ic
s/
G
et
M
2K
/M
PI
M
2K
D
ol /GM
ph
in
G
/M
ig
an
PI
et
/V
IP
L
0
Send Overhead (alone)
02/4/2009
Send & Rec Overhead
CS267 Lecture 5
Rec Overhead (alone)
Added Latency
Data from Mike Welcome, NERSC
31
Send Overhead Over Time
14
12
NCube/2
CM5
usec
10
8
SP3
Cenju4
6
T3E
CM5
4
Meiko
2
Meiko
0
1990
Paragon
Myrinet
T3D
1992
SCI
Dolphin
Dolphin
Myrinet2K
Compaq
T3E
1994
1996
1998
Year (approximate)
2000
2002
• Overhead has not improved significantly; T3D was best
• Lack of integration; lack of attention in software
Data from Kathy Yelick, UCB and NERSC
02/4/2009
CS267 Lecture 5
32
Limitations of the LogP Model
• The LogP model has a fixed cost for each messages
• This is useful in showing how to quickly broadcast a single word
• Other examples also in the LogP papers
• For larger messages, there is a variation LogGP
• Two gap parameters, one for small and one for large messages
• The large message gap is the b in our previous model
• No topology considerations (including no limits for
bisection bandwidth)
• Assumes a fully connected network
• OK for some algorithms with nearest neighbor communication,
but with “all-to-all” communication we need to refine this further
• This is a flat model, i.e., each processor is connected to
the network
• Clusters of SMPs are not accurately modeled
02/4/2009
CS267 Lecture 5
33
Programming
Distributed Memory Machines
with
Message Passing
(switch slides)
CS267 Lecture 5
34
Message Passing Libraries (1)
• Many “message passing libraries” were once available
•
•
•
•
•
•
•
•
•
Chameleon, from ANL.
CMMD, from Thinking Machines.
Express, commercial.
MPL, native library on IBM SP-2.
NX, native library on Intel Paragon.
Zipcode, from LLL.
PVM, Parallel Virtual Machine, public, from ORNL/UTK.
Others...
MPI, Message Passing Interface, now the industry standard.
• Need standards to write portable code.
• Rest of this discussion independent of which library.
• MPI details later
02/4/2009
CS267 Lecture 6
35
Message Passing Libraries (2)
• All communication, synchronization require subroutine calls
• No shared variables
• Program run on a single processor just like any uniprocessor
program, except for calls to message passing library
• Subroutines for
• Communication
•
•
Pairwise or point-to-point: Send and Receive
Collectives all processor get together to
– Move data: Broadcast, Scatter/gather
– Compute and move: sum, product, max, … of data on many
processors
• Synchronization
•
•
Barrier
No locks because there are no shared variables to protect
• Enquiries
•
02/4/2009
How many processes? Which one am I? Any messages waiting?
CS267 Lecture 6
36
Novel Features of MPI
• Communicators encapsulate communication spaces for
library safety
• Datatypes reduce copying costs and permit
heterogeneity
• Multiple communication modes allow precise buffer
management
• Extensive collective operations for scalable global
communication
• Process topologies permit efficient process placement,
user views of process layout
• Profiling interface encourages portable tools
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
37
MPI References
• The Standard itself:
• at http://www.mpi-forum.org
• All MPI official releases, in both postscript and HTML
• Other information on Web:
• at http://www.mcs.anl.gov/mpi
• pointers to lots of stuff, including other talks and
tutorials, a FAQ, other MPI pages
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
38
Books on MPI
• Using MPI: Portable Parallel Programming
with the Message-Passing Interface (2nd edition),
by Gropp, Lusk, and Skjellum, MIT Press,
1999.
• Using MPI-2: Portable Parallel Programming
with the Message-Passing Interface, by Gropp,
Lusk, and Thakur, MIT Press, 1999.
• MPI: The Complete Reference - Vol 1 The MPI Core, by
Snir, Otto, Huss-Lederman, Walker, and Dongarra, MIT
Press, 1998.
• MPI: The Complete Reference - Vol 2 The MPI Extensions,
by Gropp, Huss-Lederman, Lumsdaine, Lusk, Nitzberg,
Saphir, and Snir, MIT Press, 1998.
• Designing and Building Parallel Programs, by Ian Foster,
Addison-Wesley, 1995.
• Parallel Programming with MPI, by Peter Pacheco, MorganKaufmann, 1997.
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
39
Programming With MPI
• MPI is a library
• All operations are performed with routine calls
• Basic definitions in
• mpi.h for C
• mpif.h for Fortran 77 and 90
• MPI module for Fortran 90 (optional)
• First Program:
• Create 4 processes in a simple MPI job
• Write out process number
• Write out some variables (illustrate separate name
space)
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
40
Finding Out About the Environment
• Two important questions that arise early in a
parallel program are:
• How many processes are participating in this
computation?
• Which one am I?
• MPI provides functions to answer these
questions:
•MPI_Comm_size reports the number of processes.
•MPI_Comm_rank reports the rank, a number between
0 and size-1, identifying the calling process
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
41
Hello (C)
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "I am %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
42
Hello (Fortran)
program main
include 'mpif.h'
integer ierr, rank, size
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )
print *, 'I am ', rank, ' of ', size
call MPI_FINALIZE( ierr )
end
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
43
Hello (C++)
#include "mpi.h"
#include <iostream>
int main( int argc, char *argv[] )
{
int rank, size;
MPI::Init(argc, argv);
rank = MPI::COMM_WORLD.Get_rank();
size = MPI::COMM_WORLD.Get_size();
std::cout << "I am " << rank << " of " << size <<
"\n";
MPI::Finalize();
return 0;
}
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
44
Notes on Hello World
• All MPI programs begin with MPI_Init and end with
MPI_Finalize
• MPI_COMM_WORLD is defined by mpi.h (in C) or
mpif.h (in Fortran) and designates all processes in the
MPI “job”
• Each statement executes independently in each process
• including the printf/print statements
• I/O not part of MPI-1but is in MPI-2
• print and write to standard output or error not part of either MPI1 or MPI-2
• output order is undefined (may be interleaved by character, line,
or blocks of characters),
• The MPI-1 Standard does not specify how to run an MPI
program, but many implementations provide
mpirun –np 4 a.out
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
45
MPI Basic Send/Receive
• We need to fill in the details in
Process 0
Process 1
Send(data)
Receive(data)
• Things that need specifying:
• How will “data” be described?
• How will processes be identified?
• How will the receiver recognize/screen messages?
• What will it mean for these operations to complete?
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
46
Some Basic Concepts
• Processes can be collected into groups
• Each message is sent in a context, and must be
received in the same context
• Provides necessary support for libraries
• A group and context together form a
communicator
• A process is identified by its rank in the group
associated with a communicator
• There is a default communicator whose group
contains all initial processes, called
MPI_COMM_WORLD
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
47
MPI Datatypes
• The data in a message to send or receive is
described by a triple (address, count, datatype),
where
• An MPI datatype is recursively defined as:
• predefined, corresponding to a data type from the
language (e.g., MPI_INT, MPI_DOUBLE)
• a contiguous array of MPI datatypes
• a strided block of datatypes
• an indexed array of blocks of datatypes
• an arbitrary structure of datatypes
• There are MPI functions to construct custom
datatypes, in particular ones for subarrays
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
48
MPI Tags
• Messages are sent with an accompanying userdefined integer tag, to assist the receiving
process in identifying the message
• Messages can be screened at the receiving end
by specifying a specific tag, or not screened by
specifying MPI_ANY_TAG as the tag in a
receive
• Some non-MPI message-passing systems have
called tags “message types”. MPI calls them
tags to avoid confusion with datatypes
02/4/2009
CS267 Lecture 6
Slide source: Bill Gropp, ANL
49
Extra
Slides
CS267 Lecture 6
50
Challenge 2010 - scaling to Petaflops level
• Applications will face (at least) three challenges
• Scaling to 100,000s of processors
• Interconnect topology
• Memory access
• We have yet to scale to the 100,000 processor level
•
•
•
02/4/2009
Algorithms
Tools
System Software
CS267 Lecture 6
51
Challenge 2010 - 2018: Developing a New Ecosystem fo
From the NRC Report on “The Future of Supercomputing”:
• Platforms, software, institutions, applications, and people who solve
supercomputing applications can be thought of collectively as an ecosystem
• Research investment in HPC should be informed by the ecosystem point of
view - progress must come on a broad front of interrelated technologies,
rather than in the form of individual breakthroughs.
Pond ecosystem image from
http://www.tpwd.state.tx.us/expltx/eft/txwild/
pond.htm
02/4/2009
CS267 Lecture 6
52
Supercomputing Ecosystem (1988)
Cold War and Big Oil spending in the 1980s
Powerful Vector Supercomputers
20 years of Fortran applications base in physics codes
and third party apps
02/4/2009
CS267 Lecture 6
53
Supercomputing Ecosystem (until about 1988)
Cold War and Big Oil spending in the 1980s
Powerful Vector Supercomputers
20 years of Fortran applications base in physics codes
and third party apps
02/4/2009
CS267 Lecture 6
54
Supercomputing Ecosystem (2005)
Commercial Off The Shelf technology (COTS)
“Clusters”
02/4/2009
12 years of legacy MPI applications base
CS267 Lecture 6
55
Supercomputing Ecosystem (2005)
Commercial Off The Shelf technology (COTS)
“Clusters”
02/4/2009
12 years of legacy MPI applications base
CS267 Lecture 6
56
How Did We Make the Change?
• Massive R&D Investment
• HPCC in the US
• Vigorous computer science experimentation in languages,
tools, system software
• Development of Grand Challenge applications
• External Driver
• Industry transition to CMOS micros
• All changes happened virtually at once
• Ecosystem change
02/4/2009
CS267 Lecture 6
57
Observations on the 2005 Ecosystem
• It is very stable
• attempts of re-introducing old species failed (X1)
• attempts of introducing new species failed (mutation of
Blue Gene 1999 to BG/L 2005)
• It works well
• just look around the room
• So why isn’t everybody happy and content?
02/4/2009
CS267 Lecture 6
58
Limits to Cluster Based Systems for HPC
• Memory Bandwidth
• Commodity memory interfaces [SDRAM, RDRAM, DDRAM]
• Separation of memory and CPU implementations limits
performance
• Communications fabric/CPU/Memory Integration
• Current networks are attached via I/O devices
• Limits bandwidth and latency and communication semantics
• Node and system packaging density
• Commodity components and cooling technologies limit
densities
• Blade based servers moving in right direction but are not
High Performance
• Ad Hoc Large-scale Systems Architecture
• Little functionality for RAS
• Lack of systems software for production environment
• … but departmental and single applications clusters
will be highly successful
02/4/2009
CS267 Lecture 6
After Rick Stevens, Argonne
59
Comparison Between Architectures (2001)
Alvarez
Processor
Pentium III
Clock speed
867
# nodes
80
# processors/node
2
Peak (GF/s)
139
Memory (GB/node)
1
Interconnect
Myrinet 2000
Disk (TB)
1.5
Seaborg
Power 3
375
184
16
4416
16-64
Colony
20
Mcurie
EV-5
450
644
579.6
0.256
T3E
2.5
Source: Tammy Welcome, NERSC
02/4/2009
CS267 Lecture 6
60
Performance Comparison(2)
Class C NPBs
Alvarez
64
Seaborg
128
BT
CG
EP
FT
IS
LU
MG
SP
61.0
17.1
3.9
31.3
2.4
26.9
56.6
40.9
per processor
SSP (Gflops/s)
39.0
6.2
13.9
3.9
20.0
2.1
38.7
46.9
64
111.9
34.0
3.9
61.2
2.1
209.0
133.2
100.7
108.3
318.9
Mcurie
128
30.9
3.9
54.6
1.3
133.7
101.7
64
128
55.7
9.3
2.6
30.8
1.1
60.4
93.9
41.8
11.8
2.6
30.1
1.0
56.0
80.0
48.7
31.3
Source: Tammy Welcome, NERSC
02/4/2009
CS267 Lecture 6
61
Summary – Wrap-up
• Network structure and concepts
• Switching, routing, flow control
• Topology, bandwidth, latency, bisection bandwidth, diameter
• Performance models
• PRAM, a - b, and LogP
• Workstation/PC clusters
• Programming environment, hardware
• Challenges
• Message passing implementation
02/4/2009
CS267 Lecture 6
62
Wednesday Lecture
Ended Here
CS267 Lecture 6
63
Effectiveness of Commodity PC Clusters
• Dollars/performance based on peak
• SP and Alvarez are comparable $/TF
• Get lower % of peak on Alvarez than SP
• Based on SSP, 4.5% versus 7.2% for FP intensive
applications
• Based on sequential NPBs, 5-13.8% versus 6.3-21.6% for
FP intensive applications
• x86 known not to perform well on FP intensive applications
• $/Performance and cost of ownership need to be
examined much more closely
• Above numbers do not take into account differences in
system balance or configuration
• SP was aggressively priced
• Alvarez was vendor-integrated, not self-integrated
02/4/2009
CS267 Lecture 6 Source: Tammy Welcome, NERSC
64
Workstation/PC Clusters
• Reaction to commercial MPPs:
• build parallel machines out of commodity components
• Inexpensive workstations or PCs as computing nodes
• Fast (gigabit) switched network between nodes
• Benefits:
•
•
•
•
•
10x - 100x cheaper for comparable performance
Standard OS on each node
Follow commodity tech trends
Incrementally upgradable and scalable
Fault tolerance
• Trends:
•
•
•
•
02/4/2009
Berkeley NOW (1994): 100 UltraSPARCs, Myrinet
ASCI RED (1997): 4510 dual Pentium II nodes, custom network
Millennium (1999): 100+ dual/quad Pentium IIIs, Myrinet
Google (2001): 8000+ node Linux cluster, ??? network
CS267 Lecture 6
65

Distributed Memory Machines and Programming James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 5 Recap of Last Lecture • Shared memory multiprocessors • Caches may be either shared or.

Transcript Distributed Memory Machines and Programming James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 CS267 Lecture 5 Recap of Last Lecture • Shared memory multiprocessors • Caches may be either shared or.

Directory