Intro to Parallel Computing

Download Report

Transcript Intro to Parallel Computing

Lecture 1
Parallel Processing for
Scientific Applications
1
Parallel Computing
Multiple processes cooperating
to solve a single problem
2
Why Parallel Computing ?
 Easy to get huge computational
problems
physical simulation in 3D: 100 x 100 x
100 = 106
oceanography example: 48 M cells,
several variables per cell, one time step =
30 Gflop = 30,000,000,000 floating point
operations)
3
Why Parallel Computing ?
 Numerical Prototyping:
real phenomena are too complicated to model
real experiments are too hard, too expensive,
or too dangerous for a laboratory:
 Examples: simulate aging effects on nuclear
weapon (ASCI Project), oil reservoir simulation,
large wind tunnels, galactic evolution, whole
factory or product life cycle design and
optimization, DNA matching (bioinformatics)
4
An Example -- Climate Prediction
Grid point
5
An Example -- Climate Prediction
 What is Climate?
Climate (longitude, latitude, height, time)
return a vector of 6 values:
• temperature, pressure, humidity, and wind velocity
(3 words)
 Discretize: only evaluate on a grid point;
Climate(i, j, k, n), where
t = n*dt, dt is a fixed time step, n an integer, i,j,k
are integers indexing the grid cells.
6
An Example -- Climate Prediction
 Area: 3000 x 3000 miles, Height: 11 miles ---
3000x3000x11 cube mile domain
 Segment size: 0.1x0.1x0.1 cube miles --1011
different segments
 Two-day period, dt = 0.5 hours (2x24x2=96)
 100 instructions per segment
the computation of parameters inside a segment uses
the initial values and the values from neighboring
segments
7
An Example -- Climate Prediction
 A single updating of the parameters in the
entire domain requires 1011x100, or 1013
instructions (10 Trillion instructions). Update
96 times -- 1015 instructions
 Single-CPU supercomputer:
1000 MHz RISC CPU
Execution time: 280 hours.
 ??? Taking 280 hours to predict the weather
for next 48 hours.
8
Issues in Parallel Computing
 Design of Parallel Computers
 Design of Efficient Algorithms
 Methods for Evaluating Parallel Algorithms
 Parallel Programming Languages
 Parallel Programming Tools
 Portable Parallel Programs
 Automatic Programming of Parallel
Computers
9
Some Basic Studies
10
Design of Parallel Computer
 Parallel computing is information
processing that emphasizes the concurrent
manipulation of data elements belonging to
one or more processes solving a single
problem [Quinn:1994]
 Parallel computer: a multiple-processor
computer capable of parallel computing.
11
Efficient Algorithms
 Throughput: the number of results per
second
 Speedup: S = T1 / Tp
 Efficiency = S/P (P= no. of processor)
12
Scalability
 Algorithmic scalability: an algorithm is scalable if
the available parallelism increases at least linearly
with problem size.
 Architectural scalability: an architecture is
scalable if it continues to yield the same
performance per processor, as the number of
processors is increased and as the problem size is
increased.
Solve larger problems in the same amount of time by
buying a parallel computer with more processors.
($$$$$ ??)
13
Parallel Architectures
 SMP: Symmetric Multiprocessor (SGI Power
Challenger, SUN Enterprise 6000)
 MPP: Massively Parallel Processors
INTEL ASCI Red: 9152 processors (1997)
SGI/Cray T3E120 LC1080-512 1080 nodes (1998)
 Cluster: True distributed systems -- tightly-
coupled software on a loosely-coupled (LANbased) hardware.
NOW: Network of Workstation or COW: Cluster of
Workstations, Pile-of-PC (PoPC)
14
Levels of Abstraction
Applications
(Sequential ?)
(Parallel ?)
Programming Models
(Shared Memory ?)
(Message Passing ?)
Hardware Architecture
Addressing Space
(Shared Memory?)
(Distributed Memory ?)
15
Is Parallel Computing Simple ?
16
A Simple Example
 Take a paper and pen.
 Algorithm:
 Step 1: Write a number on your pad
 Step 2: Compute the sum of your
neighbor's values
 Step 3: Write the sum on the paper
17
** Questions 1
How do you get values
from your neighbors?
18
Shared Memory Model
5, 0, 4
19
Message Passing Model
Hey !! What’s
your number ?
20
** Questions 2
Are you sure the sum is
correct ?
21
Some processor starts earlier
5+0+4 = 9
22
Synchronization Problem !!
Step 3.
9
9+5+0=14
Step 2.
23
** Questions 3
How do you decide when you are
done? (throw away the paper)
24
Some processor finished earlier
5+0+4 = 9
25
Some processor finished earlier
9
26
Some processor finished earlier
Sorry !!
We closed !!
27
Some processor finished earlier
Sorry !!
We closed !!
?+5+0=?
Step 2.
28
Classification of Parallel
Architectures
29
1. Based on Control Mechanism
 Flynn’s Classification : data or
instruction stream :
SISD: single instruction stream single data streams
SIMD: single instruction stream multiple data streams
MIMD: multiple instruction streams multiple data
streams
MISD: multiple instruction stream single data stream
30
SIMD
 Examples:
Thinking Machines: CM-1, CM-2
MasPar MP-1 and MP-2
 Simple processor: e.g., 1- or 4-bit CPU
 Fast global synchronization (global clock)
 Fast neighborhood communication
 Applications: image/signal processing, numerical
analysis, data compression,...
31
2. Based on Address-space
organization
 Bell’s Classification on MIMD architecture
Message-passing architecture
• local or private memory
• multicomputer = MIMD message-passing
computer (or distributed-memory computer)
Shared-address-space architecture
• hardware support for one-side communication
(read/write)
• multiprocessor = MIMD shared-address-space
computers
32
Address Space
 A region of a computer’s total memory within
which addresses are continuous and may refer to
one another directly by hardware.
 A shared memory computer has only one uservisible address space
 A disjoin memory computer can have several.
 Disjoint memory is more commonly called
distributed memory, but the memory of many
shared memory computer (multiprocessors) is
physically distributed.
33
Multiprocessors vs. Multicomputers
 Shared-Memory Multiprocessors Models
UMA: uniform memory access (all SMP servers)
NUMA: nonuniform-memory-access (DASH, T3E)
COMA: cache-only memory architecture (KSR)
 Distributed-Memory Multicomputers Model
message-passing network
NORMA model (no-remote-memory-access)
IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI
Red, cluster
34
Parallel Computers at HKU
 Symmetric Multiprocessors (SMPs)
 SGI PowerChallenge
 Cluster:
 IBM PowerPC Clusters
 Distributed Memory Machine
 IBM SP2
CYC 807
CYC LG 102.
Computer Center
Symmetric Multiprocessors (SMPs)
 Processors are connected to a shared
memory module through a shared bus
 Each processor has equal right to access :
• the shared memory
• all I/O devices
 A single copy of OS
36
Pentium Pro
processor bus
32-bit address
32-bit data
132 MB/s
NIC
P6
P6
P6
32-bit address
64-bit data
533 MB/s
P6
Memory controller
PCI Bridge
PCI Bus
PCI
Device
To Network
PCI
Device
DRAM
Controller
Data
Path
Mem data
(72 bits)
MIC
MIC
MIC MIC
Interleave data
(288 bits)
37
CYC414 SRG Lab. MIC: memory interface controller
SMP Machine
SGI POWER CHALLENGE
 POWER CHALLENGE XL
– 2-36 CPUs
– 16 GB memory (for 36 CPUs)
– The bus performance: up to 1.2GB/sec
 Runs on a 64 bits OS (IRIX6.2)
 Common memory is shared which suitable for
single-address-space programming
38
Distributed Memory Machine
 Consists of multiple computers (nodes)
 Nodes are communicated by message
passing
 Each node is an autonomous computer
• Processor(s) (may be an SMP)
• Local memory
• Disks, network adapter, and other I/O
peripherals
 No-remote-memory-access
(NORMA)
Distributed Memory Machine
IBM SP2
SP2 => Scalable POWERparallel System
Developed based on RISC System/6000
workstation
Power 2 processor, 66.6 MHz, 266 MFLOP
40
SP2 - Message Passing
41
SP2 - High Performance Switch
8x8 Switch
 Switch among the nodes simultaneously and quickly
 Maximum 40MB point-to-point bandwidth
42
SP2 - Nodes (POWER 2 processor)
 Two types of nodes:
– Thin node (smaller capacity, used to process
individual works) 4 micro-channel slots,
96KB cache, 64-512MB memory, 1-4 GB disk
– Wide node (larger capacity, used to be
servers of the system) 8 micro-channel slots,
288KB cache, 64-2048MB memory, 1-8 GB
disk
43
SP2
– The largest SP (P2SC, 120 MHz) machine:
Pacific Northwest National Lab. U.S., 512
processors, TOP 26, 1998.
44
What’s a Cluster ?
 A cluster is a group of whole computers that
works cooperatively as a single system to
provide fast and efficient computing
service.
45
Switched Ethernet
I need variable A
fromYou!
Node 2!
Thank
Node 1
OK!
Node 2
Node 3
Node 4
Clusters
 Advantages
Cheaper
Easy to scale
Coarse-grain parallelism
 Disadvantages
Poor communication performance
(typically the latency) as compared with
other parallel systems
47
TOP 500 (1997)
 TOP 1 INTEL: ASCI Red at Sandia Nat’l
Lab. USA, June 1997
 TOP 2 Hitachi/Tsukuba: CP-PACS (2048
processors), 0.368 Tflops at Univ. Tsukuba
Japan, 1996
 TOP 3 SGI/Gray: T3E 900 LC696-128 (696
processors), 0.264 Tflops at UK,
Meteorological Office UK, 1997
48
TOP 500 (June, 1998)
 TOP 1 INTEL: ASCI Red (9152 Pentium
Pro processors, 200 MHz), 1.3 Teraflops at
Sandia Nat’l Lab. U.S., since June 1997
 TOP 2 SGI/Gray: T3E 1200 LC1080-512, 1080
processors, 0.891 Tflops, U.S. government,
1998 installed
 TOP 3. SGI/Cray: T3E900 LC1248-128, 1248
processors, 0.634 Tflops, U.S. government
49
INTEL ASCI Red
Compute node: 4536
(Dual Pentium Pro 200 MHz
sharing a 533 MB/s bus)
•Peak speed.: 1.8 Teraflops (Trillion: 10 12)
•1,600 square feet
85 cabinets
50
INTEL ASCI Red (Network)
Node-to-node bidirectional bandwidth: 800 Mbytes/sec
10 times faster than SP2 (one-way:40 MB/s)
Split 2-D Mesh Interconnect
51
Cray T3E 1200
 Processor performance: 600 MHz, 1200 Mflops
 Overall system peak performance: 7.2 gigaflops
to 2.5 teraflops, scale to thousands of processors.
 Interconnect: a three-dimensional bidirectional
torus (a peak interconnect speed of 650MB/sec)
 Cray UNICOS/mk distribute OS
 Scalable GigaRing I/O system
52
Cray T3E Interconnect 3-D Torus
53
CP-PACS/2048, Japan
CPU: PA-RISC 1.1, 150 MHz
Peak Perf. 0.614 TFLOPS
54
CP-PACS Interconnect
Comm. Bandwidth: 300 MB/s per link
55
TOP 500 (Asia)
 1996:
Japan:
• (1) SR2201/1024 (1996)
Taiwan:
• (76) SP2/80
Korea:
• (97) Cray Y-MP/16
China:
• (231) SP2/32
Hong Kong:
• (232) -- SP2/32 (HKU/CC)
 1997:
Japan:
• (2)CP-PACS/2048 (1996)
• (5) SR2201/1024 (1996)
Korea:
• (34)T3E 900 LC128-128
• (154) Ultra HPC 1000
Taiwan:
• (167) SP2/80
Hong Kong:
• (426) SGI Origin 2000 (CU)
(500): SP2/38 (UCLA)
56
TOP500 Asia 1998
 Japan:
TOP 6: CP-PACS/2048 (1997, TOP 2)
TOP 12: NEC SX-4/128M4
TOP 13: NEC SX-4/128H4
TOP 14: Hitachi SR2201/1024
…more
 Korea: SGI/Cray T3E900 LC128-128 (TOP 52),...
 Taiwan: IBM SP2/110, 1998 (TOP 241)
57
More Information
 TOP500: http://www.top500.org/
 ASCI Red:
http://www.sandia.gov/ASCI/Red.htm
 Cray T3E 1200:
http://www.cray.com/products/systems/crayt
3e/1200/
 Chapter: 1.2-1.4, 2.1-2.4.1
58