CS 267: Introduction to Parallel Machines and Programming Models James Demmel [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr05 01/26/2005 CS267 Lecture 3

Download Report

Transcript CS 267: Introduction to Parallel Machines and Programming Models James Demmel [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr05 01/26/2005 CS267 Lecture 3

CS 267:
Introduction to Parallel Machines
and Programming Models
James Demmel
[email protected]
www.cs.berkeley.edu/~demmel/cs267_Spr05
01/26/2005
CS267 Lecture 3
1
Outline
• Overview of parallel machines and programming
models
• Shared memory
• Shared address space
• Message passing
• Data parallel
• Clusters of SMPs
• Grid
• Trends in real machines
01/26/2005
CS267 Lecture 3
2
A generic parallel architecture
P
P
M
P
M
P
M
M
Interconnection Network
Memory
° Where is the memory physically located?
01/26/2005
CS267 Lecture 3
3
Parallel Programming Models
• Control
• How is parallelism created?
• What orderings exist between operations?
• How do different threads of control synchronize?
• Data
• What data is private vs. shared?
• How is logically shared data accessed or
communicated?
• Operations
• What are the atomic (indivisible) operations?
• Cost
• How do we account for the cost of each of the above?
01/26/2005
CS267 Lecture 3
4
Simple Example
Consider a sum of an array function:
• Parallel Decomposition:
n 1

f ( A[i ])
i 0
• Each evaluation and each partial sum is a task.
• Assign n/p numbers to each of p procs
• Each computes independent “private” results and partial sum.
• One (or all) collects the p partial sums and computes the
global sum.
Two Classes of Data:
• Logically Shared
• The original n numbers, the global sum.
• Logically Private
• The individual function evaluations.
• What about the individual partial sums?
01/26/2005
CS267 Lecture 3
5
Programming Model 1: Shared Memory
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables
• Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap.
• Threads communicate implicitly by writing and reading shared
variables.
• Threads coordinate by synchronizing on shared variables
Shared memory
s
s = ...
y = ..s ...
01/26/2005
i: 2
i: 5
P0
P1
i: 8
Private
memory
CS267 Lecture 3
Pn
6
Shared Memory Code for Computing a Sum
static int s = 0;
Thread 1
for i = 0, n/2-1
s = s + f(A[i])
Thread 2
for i = n/2, n-1
s = s + f(A[i])
• Problem is a race condition on variable s in the program
• A race condition or data race occurs when:
- two processors (or two threads) access the same
variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so
they could happen simultaneously
01/26/2005
CS267 Lecture 3
7
Shared Memory Code for Computing a Sum
static int s = 0;
Thread 1
….
compute f([A[i]) and put in reg0
reg1 = s
reg1 = reg1 + reg0
s = reg1
…
Thread 2
…
7
compute f([A[i]) and put in reg0
reg1 = s
27
reg1 = reg1 + reg0
34
s = reg1
34
…
• Assume s=27, f(A[i])=7 on Thread1 and =9 on Thread2
• For this program to work, s should be 43 at the end
• but it may be 43, 34, or 36
• The atomic operations are reads and writes
• Never see ½ of one number
• All computations happen in (private) registers
01/26/2005
CS267 Lecture 3
8
9
27
36
36
Improved Code for Computing a Sum
static int s = 0;
static lock lk;
Thread 1
Thread 2
local_s1= 0
for i = 0, n/2-1
local_s1 = local_s1 + f(A[i])
lock(lk);
s = s + local_s1
unlock(lk);
local_s2 = 0
for i = n/2, n-1
local_s2= local_s2 + f(A[i])
lock(lk);
s = s +local_s2
unlock(lk);
• Since addition is associative, it’s OK to rearrange order
• Most computation is on private variables
- Sharing frequency is also reduced, which might improve speed
- But there is still a race condition on the update of shared s
- The race condition can be fixed by adding locks (only one
thread can hold a lock at a time; others wait for it)
01/26/2005
CS267 Lecture 3
9
Machine Model 1a: Shared Memory
• Processors all connected to a large shared memory.
• Typically called Symmetric Multiprocessors (SMPs)
• Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP)
• Difficulty scaling to large numbers of processors
• <32 processors typical
• Advantage: uniform memory access (UMA)
• Cost: much cheaper to access data in cache than main memory.
P2
P1
$
Pn
$
$
bus
memory
01/26/2005
CS267 Lecture 3
10
Problems Scaling Shared Memory
• Why not put more processors on (with larger memory?)
• The memory bus becomes a bottleneck
• Example from a Parallel Spectral Transform Shallow
Water Model (PSTSWM) demonstrates the problem
• Experimental results (and slide) from Pat Worley at ORNL
• This is an important kernel in atmospheric models
• 99% of the floating point operations are multiplies or adds,
which generally run well on all processors
• But it does sweeps through memory with little reuse of
operands, which exercises the memory system
• These experiments show serial performance, with one
“copy” of the code running independently on varying
numbers of procs
•
•
01/26/2005
The best case for shared memory: no sharing
But the data doesn’t all fit in the registers/cache
CS267 Lecture 3
11
Example: Problem in Scaling Shared Memory
• Performance degradation
is a “smooth” function of
the number of processes.
• No shared data between
them, so there should be
perfect parallelism.
• (Code was run for a 18
vertical levels with a
range of horizontal
sizes.)
01/26/2005
CS267 Lecture 3
From Pat Worley, ORNL 12
Machine Model 1b: Distributed Shared Memory
• Memory is logically shared, but physically distributed
• Any processor can access any address in memory
• Cache lines (or pages) are passed around machine
• SGI Origin is canonical example (+ research machines)
• Scales to 100s (512 have been built)
• Limitation is cache coherent protocols – need to
keep cached copies of the same address consistent
P2
P1
$
Pn
$
$
network
memory memory
01/26/2005
CS267 Lecture 3
memory
13
Programming Model 2: Message Passing
• Program consists of a collection of named processes.
• Usually fixed at program startup time
• Thread of control plus local address space -- NO shared data.
• Logically shared data is partitioned over local processes.
• Processes communicate by explicit send/receive pairs
• Coordination is implicit in every communication event.
• MPI is the most common example
s: 12
s: 14
Private
memory
s: 11
receive Pn,s
y = ..s ...
i: 2
i: 3
P0
P1
i: 1
send P1,s
Pn
Network
01/26/2005
CS267 Lecture 3
14
Computing s = A[1]+A[2] on each processor
° First possible solution – what could go wrong?
Processor 1
xlocal = A[1]
send xlocal, proc2
receive xremote, proc2
s = xlocal + xremote
Processor 2
xlocal = A[2]
send xlocal, proc1
receive xremote, proc1
s = xlocal + xremote
° If send/receive acts like the telephone system? The post office?
° Second possible solution
Processor 1
xlocal = A[1]
send xlocal, proc2
receive xremote, proc2
s = xlocal + xremote
01/26/2005
Processor 2
xloadl = A[2]
receive xremote, proc1
send xlocal, proc1
s = xlocal + xremote
CS267 Lecture 3
15
MPI – the de facto standard
In 2002 MPI has become the de facto standard for
parallel computing
The software challenge: overcoming the MPI barrier
• MPI created finally a standard for applications
development in the HPC community
• Standards are always a barrier to further development
• The MPI standard is a least common denominator
building on mid-80s technology
Programming Model reflects hardware!
“I am not sure how I will program a Petaflops computer,
but I am sure that I will need MPI somewhere” – HDS 2001
01/26/2005
CS267 Lecture 3
16
Machine Model 2a: Distributed Memory
• Cray T3E, IBM SP2
• PC Clusters (Berkeley NOW, Beowulf)
• IBM SP-3, Millennium, CITRIS are distributed memory
machines, but the nodes are SMPs.
• Each processor has its own memory and cache but
cannot directly access another processor’s memory.
• Each “node” has a network interface (NI) for all
communication and synchronization.
P0
memory
NI
P1
memory
NI
Pn
...
NI
memory
interconnect
01/26/2005
CS267 Lecture 3
17
Tflop/s Clusters
The following are examples of clusters configured out of
separate networks and processor components
• Barcelona: 4th fastest in world (20 Tflop on Top500 Nov
2004; 4,536 2.2GHz IBM Power PC970s + Myrinet)
• Shell: largest commercial engineering/scientific cluster
• NCSA: 1024 processor cluster (IA64)
• Univ. Heidelberg cluster
• PNNL: announced 8 Tflops (peak) IA64 cluster from HP
with Quadrics interconnect
• DTF in US: announced 4 clusters for a total of 13
Teraflops (peak)
01/26/2005
CS267 Lecture 3
18
Machine Model 2b: Internet/Grid Computing
• SETI@Home: Running on 500,000 PCs
• ~1000 CPU Years per Day
• 485,821 CPU Years so far
• Sophisticated Data & Signal Processing Analysis
• Distributes Datasets from Arecibo Radio Telescope
Next StepAllen Telescope Array
01/26/2005
CS267 Lecture 3
19
Programming Model 2b: Global Addr Space
• Program consists of a collection of named threads.
•
•
•
•
Usually fixed at program startup time
Local and shared data, as in shared memory model
But, shared data is partitioned over local processes
Cost models says remote data is expensive
• Examples: UPC, Titanium, Co-Array Fortran
• Global Address Space programming is an intermediate
point between message passing and shared memory
Shared memory
s[0]: 27
s[1]: 27
i: 2
i: 5
P0
P1
s[n]: 27
y = ..s[i] ...
01/26/2005
Private
memory
CS267 Lecture 3
i: 8
Pn s[myThread] = ...
20
Machine Model 2c: Global Address Space
• Cray T3D, T3E, X1, and HP Alphaserver cluster
• Clusters built with Quadrics, Myrinet, or Infiniband
• The network interface supports RDMA (Remote Direct
Memory Access)
• NI can directly access memory without interrupting the CPU
• One processor can read/write memory with one-sided
operations (put/get)
• Not just a load/store as on a shared memory machine
• Remote data is typically not cached locally
Global address
P1 NI
P0 NI
Pn NI
space may be
supported in
memory
memory
...
memory
varying degrees
interconnect
01/26/2005
CS267 Lecture 3
21
Programming Model 3: Data Parallel
• Single thread of control consisting of parallel operations.
• Parallel operations applied to all (or a defined subset) of a
data structure, usually an array
•
•
•
•
Communication is implicit in parallel operators
Elegant and easy to understand and reason about
Coordination is implicit – statements executed synchronously
Similar to Matlab language for array operations
• Drawbacks:
• Not all problems fit this model
• Difficult to map onto coarse-grained machines
A = array of all data
fA = f(A)
s = sum(fA)
A:
f
fA:
sum
s:
01/26/2005
CS267 Lecture 3
22
Machine Model 3a: SIMD System
• A large number of (usually) small processors.
• A single “control processor” issues each instruction.
• Each processor executes the same instruction.
• Some processors may be turned off on some instructions.
• Machines are very specialized to scientific computing, so
they are not popular with vendors (CM2, Maspar)
• Programming model can be implemented in the compiler
• mapping n-fold parallelism to p processors, n >> p, but it’s
hard (e.g., HPF)
control processor
P1
memory
NI
P1
memory
NI
P1
memory
NI
...
P1
memory
NI
P1
NI
memory
interconnect
01/26/2005
CS267 Lecture 3
23
Machine Model 3b: Vector Machines
• Vector architectures are based on a single processor
• Multiple functional units
• All performing the same operation
• Instructions may specific large amounts of parallelism (e.g.,
64-way) but hardware executes only a subset in parallel
• Historically important
• Overtaken by MPPs in the 90s
• Re-emerging in recent years
• At a large scale in the Earth Simulator (NEC SX6) and Cray X1
• At a small sale in SIMD media extensions to microprocessors
•
•
•
SSE, SSE2 (Intel: Pentium/IA64)
Altivec (IBM/Motorola/Apple: PowerPC)
VIS (Sun: Sparc)
• Key idea: Compiler does some of the difficult work of
finding parallelism, so the hardware doesn’t have to
01/26/2005
CS267 Lecture 3
24
Vector Processors
• Vector instructions operate on a vector of elements
• These are specified as operations on vector registers
r1
r2
…
vr1
+
vr2
+
r3
…
(logically, performs # elts
adds in parallel)
…
vr3
• A supercomputer vector register holds ~32-64 elts
• The number of elements is larger than the amount of parallel
hardware, called vector pipes or lanes, say 2-4
• The hardware performs a full vector operation in
• #elements-per-vector-register / #pipes
vr1
…
+
01/26/2005
…
vr2
+
+
++
+
CS267 Lecture 3
(actually, performs #
pipes adds in parallel)
25
Cray X1 Node
• Cray X1 builds a larger “virtual vector”, called an MSP
• 4 SSPs (each a 2-pipe vector processor) make up an MSP
• Compiler will (try to) vectorize/parallelize across the MSP
custom
blocks
12.8 Gflops (64 bit)
S
25.6 Gflops (32 bit)
V
S
V
V
S
V
V
S
V
V
V
51 GB/s
25-41 GB/s
2 MB Ecache
At frequency of
400/800 MHz
01/26/2005
0.5 MB
$
0.5 MB
$
0.5 MB
$
To local memory and network:
CS267 Lecture 3
0.5 MB
$
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray 26
Cray X1: Parallel Vector Architecture
Cray combines several technologies in the X1
•
•
•
•
•
12.8 Gflop/s Vector processors (MSP)
Shared caches (unusual on earlier vector machines)
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
Remote put/get between nodes (faster than MPI)
01/26/2005
CS267 Lecture 3
27
Earth Simulator Architecture
Parallel Vector
Architecture
• High speed (vector)
processors
• High memory
bandwidth (vector
architecture)
• Fast network (new
crossbar switch)
Rearranging commodity
parts can’t match this
performance
01/26/2005
CS267 Lecture 3
28
Machine Model 4: Clusters of SMPs
• SMPs are the fastest commodity machine, so use them
as a building block for a larger machine with a network
• Common names:
• CLUMP = Cluster of SMPs
• Hierarchical machines, constellations
• Most modern machines look like this:
• Millennium, IBM SPs, ASCI machines
• What is an appropriate programming model #4 ???
• Treat machine as “flat”, always use message
passing, even within SMP (simple, but ignores an
important part of memory hierarchy).
• Shared memory within one SMP, but message
passing outside of an SMP.
01/26/2005
CS267 Lecture 3
29
Outline
• Overview of parallel machines and programming
models
• Shared memory
• Shared address space
• Message passing
• Data parallel
• Clusters of SMPs
• Trends in real machines
01/26/2005
CS267 Lecture 3
30
TOP500
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from Linpack
Ax=b, dense problem
- Updated twice a year:
Rate
TPP performance
ISC‘xy in Germany, June xy
SC‘xy in USA, November xy
Size
- All data available from www.top500.org
01/26/2005
CS267 Lecture 3
31
TOP500 list - Data shown
•
•
•
•
•
•
•
•
•
•
•
•
Manufacturer
Computer Type
Installation Site
Location
Year
Customer Segment
# Processors
Rmax
Rpeak
Nmax
N1/2
Nworld
01/26/2005
Manufacturer or vendor
indicated by manufacturer or vendor
Customer
Location and country
Year of installation/last major update
Academic,Research,Industry,Vendor,Class.
Number of processors
Maxmimal LINPACK performance achieved
Theoretical peak performance
Problemsize for achieving Rmax
Problemsize for achieving half of Rmax
Position within the TOP500 ranking
CS267 Lecture 3
32
22nd List: The TOP10 (2003)
Rank Manufacturer
Computer
Rmax
[TF/s]
Installation Site
Country Year
Area of
Installation
# Proc
1
NEC
Earth-Simulator
35.86
Earth Simulator Center
Japan 2002
Research
5120
2
HP
ASCI Q
AlphaServer SC
13.88
Los Alamos
National Laboratory
USA 2002
Research
8192
3
Self-Made
Virginia Tech
USA 2003
Academic
2200
4
Dell
NCSA
USA 2003
Academic
2500
5
HP
Pacific Northwest
National Laboratory
USA 2003
Research
1936
6
Linux Networx
Lightning,
Opteron, Myrinet
USA 2003
Research
2816
7
Linux Networx/
Quadrics
MCR Cluster
7.63
Lawrence Livermore
National Laboratory
USA 2002
Research
2304
8
IBM
ASCI White
SP Power3
7.3
Lawrence Livermore
National Laboratory
USA 2000
Research
8192
9
IBM
Seaborg
SP Power 3
7.3
NERSC
Lawrence Berkeley Nat. Lab.
USA 2002
Research
6656
10
IBM/Quadrics
01/26/2005
xSeries Cluster
Xeon 2.4 GHz
6.59
USA 2003
Research
1920
X
10.28
Apple G5, Mellanox
Tungsten
PowerEdge, Myrinet
9.82
Mpp2, Integrity rx2600
8.63
Itanium2, Qadrics
8.05 Los Alamos National Laboratory
Lawrence Livermore
National Laboratory
CS267 Lecture 3
33
Continents Performance
01/26/2005
CS267 Lecture 3
34
Continents Performance
01/26/2005
CS267 Lecture 3
35
Customer Types
01/26/2005
CS267 Lecture 3
36
Manufacturers
01/26/2005
CS267 Lecture 3
37
Manufacturers Performance
01/26/2005
CS267 Lecture 3
38
Processor Types
01/26/2005
CS267 Lecture 3
39
Architectures
01/26/2005
CS267 Lecture 3
40
NOW – Clusters
01/26/2005
CS267 Lecture 3
41
Analysis of TOP500 Data
• Annual performance growth about a factor of 1.82
• Two factors contribute almost equally to the
annual total performance growth
• Processor number grows per year on the average
by a factor of 1.30 and the
• Processor performance grows by 1.40 compared
to 1.58 of Moore's Law
Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp
1517-1544.
01/26/2005
CS267 Lecture 3
42
Summary
• Historically, each parallel machine was unique, along
with its programming model and programming language.
• It was necessary to throw away software and start over
with each new kind of machine.
• Now we distinguish the programming model from the
underlying machine, so we can write portably correct
codes that run on many machines.
• MPI now the most portable option, but can be tedious.
• Writing portably fast code requires tuning for the
architecture.
• Algorithm design challenge is to make this process easy.
• Example: picking a blocksize, not rewriting whole algorithm.
01/26/2005
CS267 Lecture 3
43
Reading Assignment
• Extra reading for today
• Cray X1
http://www.sc-conference.org/sc2003/paperpdfs/pap183.pdf
• Clusters
http://www.mirror.ac.uk/sites/www.beowulf.org/papers/ICPP95/
• "Parallel Computer Architecture: A Hardware/Software Approach"
by Culler, Singh, and Gupta, Chapter 1.
• Next week: Current high performance architectures
• Shared memory (for Monday)
• Memory Consistency and Event Ordering in Scalable SharedMemory Multiprocessors, Gharachorloo et al, Proceedings of the
International symposium on Computer Architecture, 1990.
• Or read about the Altix system on the web (www.sgi.com)
• Blue Gene L (for Wednesday)
• http://sc-2002.org/paperpdfs/pap.pap207.pdf
01/26/2005
CS267 Lecture 3
44
Extra Slides
01/26/2005
CS267 Lecture 3
45
PC Clusters: Contributions of Beowulf
• An experiment in parallel computing systems
• Established vision of low cost, high end computing
• Demonstrated effectiveness of PC clusters for
some (not all) classes of applications
• Provided networking software
• Conveyed findings to broad community (great PR)
• Tutorials and book
• Design standard to rally
community!
• Standards beget:
books, trained people,
software … virtuous cycle
Adapted from Gordon Bell, presentation at Salishan 2000
01/26/2005
CS267 Lecture 3
46
Open Source Software Model for HPC
• Linus's law, named after Linus Torvalds, the
creator of Linux, states that "given enough
eyeballs, all bugs are shallow".
• All source code is “open”
• Everyone is a tester
• Everything proceeds a lot faster when
everyone works on one code (HPC: nothing gets done
if resources are scattered)
• Software is or should be free (Stallman)
• Anyone can support and market the code
for any price
• Zero cost software attracts users!
• Prevents community from losing HPC software
(CM5, T3E)
01/26/2005
CS267 Lecture 3
47
Cluster of SMP Approach
• A supercomputer is a stretched high-end server
• Parallel system is built by assembling nodes that are
modest size, commercial, SMP servers – just put more
of them together
Image from LLNL
01/26/2005
CS267 Lecture 3
48