CMPE 49B Sp. Top. in CMPE: Multi

Download Report

Transcript CMPE 49B Sp. Top. in CMPE: Multi

CMPE 478 Parallel Processing
picture of
Tianhe, the most powerful
computer in the world in Nov-2010
CMPE 4784
1
Von Neumann Architecture
CPU
RAM
Device
Device
BUS
• sequential computer
CMPE 4784
2
Memory Hierarchy
Fast
Registers
Cache
Real Memory
Disk
Slow
CMPE 4784
CD
3
History of Computer Architecture
•
4 Generations (identified by logic technology)
1. Tubes
2. Transistors
3. Integrated Circuits
4. VLSI (very large scale integration)
CMPE 4784
4
PERFORMANCE TRENDS
CMPE 4784
5
PERFORMANCE TRENDS
• Traditional mainframe/supercomputer performance 25%
increase per year
• But … microprocessor performance 50% increase per year
since mid 80’s.
CMPE 4784
6
Moore’s Law
• “Transistor density
doubles every 18
months”
• Moore is co-founder of
Intel.
• 60 % increase per year
• Exponential growth
• PC costs decline.
• PCs are building bricks
of all future systems.
CMPE 4784
Intel 62 core xeon Phi 2012
5 billion
7
VLSI Generation
CMPE 4784
8
Bit Level Parallelism
(upto mid 80’s)
• 4 bit microprocessors replaced by 8 bit, 16 bit, 32 bit etc.
• doubling the width of the datapath reduces the number of
cycles required to perform a full 32-bit operation
• mid 80’s reap benefits of this kind of parallelism (full 32bit word operations combined with the use of caches)
CMPE 4784
9
Instruction Level Parallelism
(mid 80’s to mid 90’s)
• Basic steps in instruction processing (instruction decode,
integer arithmetic, address calculations, could be performed in
a single cycle)
• Pipelined instruction processing
• Reduced instruction set (RISC)
• Superscalar execution
• Branch prediction
CMPE 4784
10
Thread/Process Level Parallelism
(mid 90’s to present)
• On average control transfers occur roughly once in five
instructions, so exploiting instruction level parallelism at a
larger scale is not possible
• Use multiple independent “threads” or processes
• Concurrently running threads, processes
CMPE 4784
11
Evolution of the Infrastructure
•
•
Electronic Accounting Machine Era: 1930-1950
General Purpose Mainframe and Minicomputer Era: 1959Present
• Personal Computer Era: 1981 – Present
• Client/Server Era: 1983 – Present
• Enterprise Internet Computing Era: 1992- Present
CMPE 4784
12
Sequential vs Parallel Processing
• physical limits reached
• “raw” power unlimited
• easy to program
• more memory, multiple cache
• expensive supercomputers
• made up of COTS, so cheap
• difficult to program
CMPE 4784
13
What is Multi-Core Programming ?
•
Answer: It is basically parallel programming on a single
computer box (e.g. a desktop, a notebook, a blade)
CMPE 4784
14
Another Important Benefit of
Multi-Core : Reduced Energy Consumption
Dual core
Single core
1 GHz
2 GHz
1 GHz
Single core executes
workload of N
Clock cycles
Energy per cycle(Ec) = C*Vdd
2
Each core
executes
workload of N/2
Clock cycles
2
Energy per cycle(E’c ) = C*(0.5*Vdd)
2
= 0.25*C*Vdd
Energy=Ec* N
Energy’ = 2*(E’c * 0.5 * N )
= E’c * N
= 0.25*(E c * N)
= 0.25*Energy
CMPE 4784
15
SPMD Model
(Single Program Multiple Data)
• Each processor executes the same program asynchronously
• Synchronization takes place only when processors need to
exchange data
• SPMD is extension of SIMD (relax synchronized instruction
execution)
• SPMD is restriction of MIMD (use only one source/object)
CMPE 4784
16
Parallel Processing Terminology
• Embarassingly Parallel:
-applications which are trivial to parallelize
-large amounts of independent computation
-Little communication
•Data Parallelism:
-model of parallel computing in which a single operation can be
applied to all data elements simultaneously
-amenable to SIMD or SPMD style of computation
•Control Parallelism:
-many different operations may be executed concurrently
-require MIMD/SPMD style of computation
CMPE 4784
17
Parallel Processing Terminology
• Scalability:
- If the size of problem is increased, number of processors that can be
effectively used can be increased (i.e. there is no limit on
parallelism).
- Cost of scalable algorithm grows slowly as input size and the
number of processors are increased.
- Data parallel algorithms are more scalable than control parallel
alorithms
• Granularity:
- fine grain machines: employ massive number of weak processors
each with small memory
- coarse grain machines: smaller number of powerful processors each
with large amounts of memory
CMPE 4784
18
Models of Parallel Computers
1.
Message Passing Model
-
Distributed memory
Multicomputer
2. Shared Memory Model
- Multiprocessor
- Multi-core
3. Theoretical Model
- PRAM
•
New architectures: combination of 1 and 2.
CMPE 4784
19
Theoretical PRAM Model
•
•
Used by parallel algorithm designers
Algorithm designers do not want to worry about low level
details: They want to concentrate on algorithmic details
• Extends classic RAM model
• Consist of :
– Control unit (common clock), synchronous
– Global shared memory
– Unbounded set of processors, each with its private own
memory
CMPE 4784
20
Theoretical PRAM Model
•
Some characteristics
– Each processor has a unique identifier, mypid=0,1,2,…
– All processors operate synhronously under the control of a
common clock
– In each unit of time, each procesor is allowed to execute an
instruction or stay idle
CMPE 4784
21
Various PRAM Models
weakest
EREW (exlusive read / exclusive write)
CREW (concurrent read / exclusive write)
CRCW (concurrent read / concurrent write)
Common (must write the same value)
Arbitrary (one processor is chosen arbitrarily)
Priority (processor with the lowest index writes)
strongest
CMPE 4784
(how write conflicts to the same memory location
are handled)
22
Flynn’s Taxonomy
• classifies computer architectures according to:
1. Number of instruction streams it can process at a time
2. Number of data elements on which it can operate
simultaneously
Data Streams
Single
SISD
Multiple
SIMD
Single
Instruction Streams
MISD
CMPE 4784
MIMD
Multiple
23
Shared Memory Machines
Shared Address Space
process
(thread)
process
(thread)
process
(thread)
process
(thread)
process
(thread)
•Memory is globally shared, therefore processes (threads) see single address
space
•Coordination of accesses to locations done by use of locks provided by
thread libraries
•Example Machines: Sequent, Alliant, SUN Ultra, Dual/Quad Board Pentium PC
•Example Thread Libraries: POSIX threads, Linux threads.
CMPE 4784
24
Shared Memory Machines
• can be classified as:
-UMA: uniform memory access
-NUMA: nonuniform memory access
based on the amount of time a processor takes to access local and
global memory.
P
P
M
M
M
P
P
..
M
Interconnection
network/
or BUS
P
(a)
CMPE 4784
P
Interconnection
network
M
P
M
M
M
M
..
..
..
..
M
P
M
P
Interconnection
network
M
M
(b)
(c)
25
Distributed Memory Machines
M
process
process
M
Network
M
process
process
M
M
process
•Each processor has its own local memory (not directly accessible by others)
•Processors communicate by passing messages to each other
•Example Machines: IBM SP2, Intel Paragon, COWs (cluster of workstations)
•Example Message Passing Libraries: PVM, MPI
CMPE 4784
26
Beowulf Clusters
•Use COTS, ordinary PCs and networking equipment
•Has the best price/performance ratio
PC cluster
CMPE 4784
27
Multi-Core Computing
• A multi-core microprocessor is one which combines two or more
independent processors into a single package, often a single integrated
circuit.
• A dual-core device contains only two independent microprocessors.
CMPE 4784
28
Comparison of Different Architectures
CPU State
Execution
unit
Cache
Single Core Architecture
CMPE 4784
29
Comparison of Different Architectures
CPU State
Execution
unit
CPU State
Execution
unit
Cache
Cache
Multiprocessor
CMPE 4784
30
Comparison of Different Architectures
CPU State
CPU State
Execution
unit
Cache
Hyper-Threading Technology
CMPE 4784
31
Comparison of Different Architectures
CPU State
Execution
unit
Cache
CPU State
Execution
unit
Cache
Multi-Core Architecture
CMPE 4784
32
Comparison of Different Architectures
CPU State
CPU State
Execution
unit
Execution
unit
Cache
Multi-Core Architecture with Shared Cache
CMPE 4784
33
Comparison of Different Architectures
CPU State
CPU State
Execution
unit
Cache
CPU State
CPU State
Execution
unit
Cache
Multi-Core with Hyper-Threading Technology
CMPE 4784
34
CMPE 4784
35
Top 500 Most Power
Supercomputer Lists
•
•
http://www.top500.org/
……..
CMPE 4784
36
Grid Computing
• provide access to computing power and various resources
just like accessing electrical power from electrical grid
• Allows coupling of geographically distributed resources
• Provide inexpensive access to resources irrespective of their
physical location or access point
• Internet & dedicated networks can be used to interconnect
distributed computational resources and present them as a
single unified resource
• Resources: supercomputers, clusters, storage systems, data
resources, special devices
CMPE 4784
37
Grid Computing
• the GRID is, in effect, a set of software tools, which when
combined with hardware, would let users tap processing power
off the Internet as easily as the electrical power can be drawn
from the electricty grid.
• Examples of Grids:
-TeraGrid (USA)
-EGEE Grid (Europe)
- TR-Grid (Turkey)
CMPE 4784
38
GRID COMPUTING
Power Grid
CMPE 4784
Compute Grid
Archeology
Astronomy
Astrophysics
Civil Protection
Comp. Chemistry
Earth Sciences
Finance
Fusion
Geophysics
High Energy Physics
Life Sciences
Multimedia
Material Sciences
…
CMPE 4784
>250 sites
48 countries
>50,000 CPUs
>20 PetaBytes
>10,000 users
>150 VOs
>150,000 jobs/day
40
Virtualization
• Virtualization is abstraction of computer resources.
• Make a single physical resource such as a server, an
•
•
operating system, an application, or storage device appear
to function as multiple logical resources
It may also mean making multiple physical resources such
as storage devices or servers appear as a single logical
resource
Server virtualization enables companies to run more than
one operating system at the same time on a single machine
CMPE 4784
41
Advantages of Virtualization
•
•
•
•
Most servers run at just 10-15 %capacity – virtualization
can increase server utilization to 70% or higher.
Higher utilization means fewer computers are required to
process the same amount of work. Fewer machines means
less power consumption.
Legacy applications can also be run on older versions of an
operating system
Other advantages: easier administration, fault tolerancy,
security
CMPE 4784
42
VMware Virtual Platform
Virtual machine 1
Virtual machines
Virtual machine 2
Apps 1
Apps 2
OS 1
OS 2
X86, motherboard
disks, display, net ..
X86, motherboard
disks, display, net ..
VMware Virtual Platform
Real machines
X86, motherboard, disks, display, net ..
•VMware is now tens of billion dollar company !!
CMPE 4784
43
Cloud Computing
•Style of computing in which IT-related capabilities are provided “as a
service”,allowing users to access technology-enabled services from the Internet
("in the cloud") without knowledge of, expertise with, or control over the
technology infrastructure that supports them.
•General concept that incorporates software as a service (SaaS), Web 2.0 and
other recent, well-known technology trends, in which the common theme is
reliance on the Internet for satisfying the computing needs of the users.
CMPE 4784
44
Cloud Computing
•
•
•
•
•
Virtualisation provides separation between infrastructure and
user runtime environment
Users specify virtual images as their deployment building
blocks
Pay-as-you-go allows users to use the service when they want
and only pay for what they use
Elasticity of the cloud allows users to start simple and explore
more complex deployment over time
Simple interface allows easy integration with existing systems
CMPE 4784
45
Cloud: Unique Features
•
Ease of use
– REST and HTTP(S)
•
Runtime environment
– Hardware virtualisation
– Gives users full control
•
Elasticity
– Pay-as-you-go
– Cloud providers can buy hardware faster than you!
CMPE 4784
46
Example Cloud: Amazon Web Services
•
EC2 (Elastic Computing Cloud) is the computing service
of Amazon
– Based on hardware virtualisation
– Users request virtual machine instances, pointing to
an image (public or private) stored in S3
– Users have full control over each instance (e.g.
access as root, if required)
– Requests can be issued via SOAP and REST
CMPE 4784
47
Example Cloud: Amazon Web Services
•
Pricing information
http://aws.amazon.com/ec2/
CMPE 4784
48
PARALLEL PERFORMANCE MODELS
and
ALGORITHMS
CMPE 4784
49
Amdahl’s Law
• The serial percentage of a program is fixed. So speed-up obtained by
employing parallel processing is bounded.
• Lead to pessimism in in the parallel processing community and prevented
development of parallel machines for a long time.
1
Speedup =
s +
1-s
P
• In the limit:
Spedup = 1/s
s
CMPE 4784
50
Gustafson’s Law
• Serial percentage is dependent on the number of
processors/input.
• Demonstrated achieving more than 1000 fold speedup using
1024 processors.
• Justified parallel processing
CMPE 4784
51
Algorithmic Performance Parameters
•
Notation
Input size
Time Complexity of the best sequential algorithm
Number of processors
Time complexity of the parallel algorithm when run on P
processors
Time complexity of the parallel algorithm when run on 1
processors
CMPE 4784
52
Algorithmic Performance Parameters
•
Speed-Up
•
Efficiency
CMPE 4784
53
Algorithmic Performance Parameters
•
Work = Processors X Time
– Informally: How much time a parallel algorithm will take to
simulate on a serial machine
– Formally:
CMPE 4784
54
Algorithmic Performance Parameters
•
Work Efficient:
– Informally: a work efficient parallel algorithm does no more
work than the best serial algorithm
– Formally: a work efficient algorithm satisfies:
CMPE 4784
55
Algorithmic Performance Parameters
•
Scalability:
– Informally, scalability implies that if the size of the problem
is increased, the number of processors effectively used can
be increased (i.e. there is no limit on parallelism)
– Formally, scalability means:
CMPE 4784
56
Algorithmic Performance Parameters
•
Some remarks:
– Cost of scalable algorithm grows slowly as input size and
the number of procesors are increased
– Level of ‘control parallelism’ is usually a constant
independent of problem size
– Level of ‘data parallelism’ is an increasing function of
problem size
– Data parallel algorithms are more scalable than control
parallel algorithms
CMPE 4784
57
Goals in Designing Parallel Algorithms
•
Scalability:
– Algorithm cost grows slowly, preferably in a
polylogarithmic manner
•
Work Efficient:
– We do not want to waste CPU cycles
– May be an important point when we are worried about
power consumption or ‘money’ paid for CPU usage
CMPE 4784
58
Summing N numbers in Parallel
x1
x2
x3
x4
x5
x6
x7
x8
step 1
x1+x2
x2
x3+x4
x4
x5+x6
x6
x7+x8
x8
step 2
x1+..+x4
x2
x3+x4
x4
x5+..+x8
x6
x7+x8
x8
step 3
x1+..+x8
x2
x3+x4
x4
x5+..+x8
x6
x7+x8
x8
result
•Array of N numbers can be summed in log(N) steps using
N/2 processors
CMPE 4784
Prefix Summing N numbers in Parallel
x1
x2
x3
x4
x5
x6
x7
x8
step 1
x1+x2
x2+x3
x3+x4
x4+x5
x5+x6
x6+x7
x7+x8
x8
step 2
x1+..+x4 x2+..+x4 x3+..+x6 x4+..+x7 x5+..+x8 x6+..+x8 x7+x8
x8
step 3
x1+..+x8 x2+..+x8 x3+..+x8 x4+..+x8 x5+..+x8 x6+..+x8 x7+x8
x8
•Computing partial sums of an array of N numbers can be done in
log(N) steps using N processors
CMPE 4784
Prefix Paradigm for Parallel Algorithm
Design
•Prefix computation forms a paradigm for parallel algorithm
development, just like other well known paradigms such as:
– divide and conquer, dynamic programming, etc.
•Prefix Paradigm:
– If possible, transform your problem to prefix type
computation
– Apply the efficient logarithmic prefix computation
•Examples of Problems solved by Prefix Paradigm:
– Solving linear recurrence equations
– Tridiagonal Solver
– Problems on trees
– Adaptive triangular mesh refinement
CMPE 4784
Solving Linear Recurrence Equations
• Given the linear recurrence equation:
zi  ai zi 1  bi zi 2
• we can rewrite it as:
 zi  ai
z    1
 i 1  
bi   zi 1 
0  zi 2 
• if we expand it, we get the solution in terms of partial products of
coefficients and the initial values z1 and z0 :
 zi  ai bi  ai 1 bi 1  ai 2
• use
computepartial
products 
z prefix
  to 1

0  1
0  1
 i 1  
CMPE 4784
bi 2  a2
... 

0  1
b2   z1 
0  z0 
Pointer Jumping Technique
x1
x2
x3
x4
x5
x6
x7
x8
step 1
x1+.x2
x2+x3
x3+x4
x4+x5
x5+x6
x6+x7
x7+x8
x8
step 2
x1+..+x4
x2+..+x5
x3+..+x6
x4+..+x7
x5+..+x8
x6+x7
x7+x8
x8
step 3
x1+..+x8
x2+..+x8
x3+..+x8
x4+..+x8
x5+..+x8
x6+..+x8
x7+x8
x8
•A linked list of N numbers can be prefix-summed in log(N)
steps using N processors
CMPE 4784
Euler Tour Technique
a
Tree Problems:
b
e
c
f
g
h
d
•Preorder numbering
•Postorder numbering
•Number of Descendants
•Level of each node
i
•To solve such problems, first transform the tree by linearizing it
into a linked-list and then apply the prefix computation
CMPE 4784
Computing Level of Each Node by Euler
Tour Technique
a
1
-1
b
1
weight assignment:
1
-1
d
-1
1
e
-1
c
1
-1
1
-1
f
-1 1
g
-1
1
h
-1
level(v) = pw(<v,parent(v)>)
level(root) = 0
1
i
initial weights: w(<u,v>)
-1 1 -1 1 -1 -1 -1 1 -1 1 1 -1 1 -1 1 1
a d a c a b g i g h g b f
b e b a
0
1 0 1 0 1 2 3 2 3 2 1
2 1 2 1
pw(<u,v>)
prefix:
CMPE
4784
Computing Number of Descendants by
Euler Tour Technique
a
0
1
b
0
weight assignment:
0
1
d
1
0
e
1
c
0
1
0
1
f
1 0
g
1
0
h
# of descendants(v)
1
0
= pw(<parent(v),v>) pw(<v,parent(v)>)
# of descendants(root) = n
i
initial weights: w(<u,v>)
1 0 1
0 1 1 1 0 1 0 0 1 0 1 0 0
a d a c a b g i g h g b f
b e b a
8
7 7 6 6 5 4 3 3 2 2 2
1 1 0 0
pw(<u,v>)
prefix:
CMPE
4784
Preorder Numbering by Euler Tour
Technique
1
a
1
2
0
b
e
0
8
0
1
4 f
0
c
1
3
1
0
0 1
g
6 h
0
1
0
d
9
5
0
1
1
weight assignment:
preorder(v) = 1 + pw(<v,parent(v)>)
preorder(root) = 1
1
i 7
initial weights: w(<u,v>)
0 1 0
1 0 0 0 1 0 1 1 0 1 0 1 1
a d a c a b g i g h g b f
b e b a
8
8 7 7 6 6 6 6 5 5 4 3
3 2 2 1
pw(<u,v>)
prefix:
CMPE
4784
Postorder Numbering by Euler Tour
Technique
9
a
0
6
1
b
1
e
2
7
1
0
f
1
c
0
1
0
1
1 0
0
1
d 8
5
g
postorder(v)
= pw(<parent(v),v>)
1
0
3 h
0
weight assignment:
1
0
postorder(root) = n
i 4
initial weights: w(<u,v>)
1 0 1
0 1 1 1 0 1 0 0 1 0 1 0 0
a d a c a b g i g h g b f
b e b a
8
7 7 6 6 5 4 3 3 2 2 2
1 1 0 0
pw(<u,v>)
prefix:
CMPE
4784
Binary Tree Traversal
• Preorder
• Inorder
• Postorder
CMPE 4784
Brent’s Theorem
• Given a parallel algorithm with computation time D, if parallel
algorithm performs W operations then P processors can execute
the algorithm in time D + (W-D)/P
For proof: consider DAG representation of computation
CMPE 4784
Work Efficiency
• Parallel Summation
• Parallel Prefix Summation
CMPE 4784
Work Efficiency
• Parallel Summation
• Parallel Prefix Summation
CMPE 4784