TI1400 Computer Organization at TU Delft

Download Report

Transcript TI1400 Computer Organization at TU Delft

Advanced computer systems
(Chapter 12)
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_12.ppt
1
Large-Scale Computer Systems Today
• Low-energy defibrillation
- Saves lives
- Affects >2M people/year
• Studies involving both
laboratory experiments and
computational simulation
2
TI1400/11-PDS
TU-Delft
Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today
• Genome sequencing
- May save lives
- The $1,000 barrier
- Large-scale molecular
dynamics simulations
• Tectonic plate movement
- May save lives
- Adaptive fine mesh
simulations
- Using 200,000 processors
3
TI1400/11-PDS
TU-Delft
Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today
• Public Content Generation
- Wikipedia
- Affects how we think about
collaborations
• “The distribution of effort
has increasingly become
more uneven, unequal”
Sorin Adam Matei
Purdue University
4
TI1400/11-PDS
TU-Delft
Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today
• Online Gaming
- World of Warcraft, Zynga
- Affects >250M people
• “As an organization, World
of Warcraft utilizes 20,000
computer systems, 1.3
petabytes of storage, and
more than 4600 people.”
- 75,000 cores
- Upkeep: >135,000$/day (?)
5
TI1400/11-PDS
TU-Delft
Source: http://www.gamasutra.com/php-bin/news_index.php?story=25307 and http://spectrum.ieee.org/consumer-electronics/gaming/engineering-everquest/0
and http://35yards.wordpress.com/2011/03/01/world-of-warcraft-by-the-numbers/
Why parallelism (1/4)
• Fundamental laws of nature:
- example: channel widths are becoming so small
that quantum properties are going to determine
device behaviour
- signal propagation time increases when channel
widths shrink
6
TI1400/11-PDS
TU-Delft
Why parallelism (2/4)
• Engineering constraints:
- Phase transition time of a component is a good measure for
the maximum obtainable computing speed
• example: optical or superconducting devices
can switch in 10-12 seconds
• optimistic suggestion: 1 TIPS (Tera Instructions
Per Second, 1012 ) is possible
- However, we must calculate something
• assume we need 10 phase transitions: 0.1 TIPS
7
TI1400/11-PDS
TU-Delft
Why parallelism (3/4)
But what about memory ?
0.5 cm
• It takes light approximately 16 picoseconds to cross
0.5 cm, yielding a possible execution rate of 60 GIPS
• However, in silicon, speed is about 10 times slower,
resulting in 6 GIPS
8
TI1400/11-PDS
TU-Delft
Why parallelism (4/4)
• Speed of sequential computers is limited to
a few GIPS
• Improvements by using parallelism:
- multiple functional units (instruction-level parallelism)
- multiple CPUs (parallel processing)
9
TI1400/11-PDS
TU-Delft
Quantum Computing?
Source: http://www.engadget.com/2011/05/29/d-wave-sells-first-commercial-quantum-computer-to-lockheed-marti/
• “Qubits are quantum bits that
can be in an “on”, “off”, or “both”
state due to fuzzy physics at the
atomic level.”
• Does surrounding noise matter?
- Wim van Dam, Nature Physics 2007
• May 25, 2011
- Lockheed Martin (10M$)
- D-Wave One 128 qubit
10
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers
A Programmer’s View
Performance Considerations
11
TI1400/11-PDS
TU-Delft
Classification of computers (Flynn Taxonomy)
• Single Instruction, Single Data (SISD)
- conventional system
• Single Instruction, Multiple Data (SIMD)
- one instruction on multiple data objects
• Multiple Instruction, Multiple Data (MIMD)
- multiple instruction streams on multiple data streams
• Multiple Instruction, Single Data (MISD)
- ?????
12
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers
A Programmer’s View
Performance Considerations
13
TI1400/11-PDS
TU-Delft
SIMD (Array) Processors
.....
PE
CM-5’91
Instruction
Issuing
Unit
INCR
CM-2’87
PE = Processing Element
Peak:
28GFLOPS
Sustainable:
5-10%
14
TI1400/11-PDS
Sources: http://cs.adelaide.edu.au/~sacpc/hardware.html#cm5 and http://www.paulos.net/other/cm2.html and
http://boards.straightdope.com/sdmb/archive/index.php/t-515675.html (about the blinking leds)
TU-Delft
MIMD
Uniform Memory Access (UMA) architecture
Any processor can access directly any memory.
P1
P2
Pm
......
interconnection network
0
1
2
3
4
5
......
.
.
N
M1
M2
Mk
Uniform Memory Access (UMA) computer
15
TI1400/11-PDS
TU-Delft
MIMD
NUMA architecture
Any processor can access directly any memory.
P1 0
P2 34
1
2
......
5
M1
Pm
.
.
N
M2
Mm
interconnection network
Non-Uniform Memory Access (NUMA) computer
realization in hardware or in software
(distributed shared memory)
16
TI1400/11-PDS
TU-Delft
MIMD
Distributed memory architecture
Any processor can access any memory,
but sometimes through another processor (via messages).
P1 0
P2 01
1
2
...... Pm
2
M1
M2
0
1
2
Mm
interconnection network
17
TI1400/11-PDS
TU-Delft
Example 1: Graphical Processing Units’s
CPU versus GPU
• CPU: Much cache and control logic
• GPU: Much compute logic
18
TI1400/11-PDS
TU-Delft
GPU Architecture
SIMD architecture
•
•
•
•
•
Multiple SIMD units
SIMD pipelining
Simple processors
High branch penalty
Efficient operation on
- parallel data
- regular streaming
19
TI1400/11-PDS
TU-Delft
Example 2: Cell B.E.
Distributed memory architecture
8 identical cores
PowerPC
20
TI1400/11-PDS
TU-Delft
Example 3: Intel Quad-core
Shared Memory MIMD
21
TI1400/11-PDS
TU-Delft
Example 4: Large MIMD Clusters
BlueGene/L
22
TI1400/11-PDS
TU-Delft
Supercomputers Over Time
Source: http://www.top500.org
23
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks (I/O)
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers
A Programmer’s View
Performance Considerations
24
TI1400/11-PDS
TU-Delft
Interconnection networks
(I/O between processors)
•
•
Difficulty in building systems
with many processors:
the interconnections
Important parameters:
1. Diameter:
- Maximal distance between any two processors
2. Degree:
- Maximal number of connections per processor
3. Total number of connections (Cost)
4. Bisection width
- Largest number of simultaneous messages
25
TI1400/11-PDS
TU-Delft
Multiple bus
Bus 1
Bus 2
(Multiple) bus structures
26
TI1400/11-PDS
TU-Delft
Cross bar
Sun E10000
N
N2 switches
Cross-bar interconnection network
27
TI1400/11-PDS
Source: http://www.cray-cyber.org/systems/E10k_detail.php
TU-Delft
Multi-stage networks (1/4)
P0
P1
P3
stage1 stage2 stage3
path
from P5
to P3
8 modules
3-bit ids
P5
P7
28
TI1400/11-PDS
TU-Delft
Multi-stage networks (2/4)
P0
0
P1
1
P3
P4
0
connections
P4-P0 and
P5-P3
both use
1
0
P5
stage1 stage2 stage3
“Shuffle”: 2 x ½ deck, interleave
TI1400/11-PDS
Shuffle
Network
29
TU-Delft
Multi-stage network (3/4)
• Multistage networks: multiple steps
• Example: Shuffle or Omega network
• Every processor identified by three-bit number (in
general, n-bit number)
• Message from processor to another contains identifier of
destination
• Routing algorithm: In every stage,
- inspect one bit of destination
- if 0: use upper output
- if 1: use lower output
30
TI1400/11-PDS
TU-Delft
Multi-stage network (4/4)
• Properties:
- Let N = 2n be the number of processing elements
- Number of stages n = log2N
- Number of switches per stage N/2
- Total number of (2x2) switches N(log2N)/2
• Not every pair of connections can be simultaneously
realized
- Blocking
31
TI1400/11-PDS
TU-Delft
Hypercubes (1/3)
Non-uniform delay, so for NUMA architectures.
10
11
n=2
00
000 -> 111
010
n=3
000
01
011
110
001
100
n.2n-1 connections
maximum distance n hops
• Connected PEs differ by 1 bit
111
• Routing:
- scan bits from right to left
- if different, send to neighbor
101
with same bit different
- repeat until end
32
TI1400/11-PDS
TU-Delft
Hypercubes (2/3)
• Question:
what is the average distance between two nodes in a
hypercube?
33
TI1400/11-PDS
TU-Delft
Mesh
Constant number
of connections
per node
35
TI1400/11-PDS
TU-Delft
Torus
mesh with
wrap-around
connections
36
TI1400/11-PDS
TU-Delft
Tree
37
TI1400/11-PDS
TU-Delft
Fat tree
…
Nodes have multiple parents
38
TI1400/11-PDS
TU-Delft
Local networks
• Ethernet
- based on collision detection
PC
- upon collision, back off and randomly try later
- speed up to 100 Gb/s (Terabit Ethernet?)
• Token ring
- based on token circulation on ring
- possession of token allows putting message on the
ring
PC
39
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers
A Programmer’s View
Performance Considerations
40
TI1400/11-PDS
TU-Delft
Memory organization (1/2)
UMA architectures.
Processor
Secondary
Cache
Network
Interface
network
41
TI1400/11-PDS
TU-Delft
Memory organization (2/2)
NUMA architectures.
Processor
Secondary
Cache
Local
Memory
Network
Interface
network
42
TI1400/11-PDS
TU-Delft
Cache coherence
• Problem: caches in multiprocessors may have
copies of the same variable
- Copies must be kept identical
• Cache coherence: all copies of a shared
variable have the same value
• Solutions:
- write through to shared memory and all caches
- invalidate cache entries in all other caches
• Snoopy caches:
- Proc.Elements sense
write and adapt cache or do invalidate
43
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers
A Programmer’s View
Performance Considerations
44
TI1400/11-PDS
TU-Delft
Parallelism
Language construct:
PARBEGIN
PARBEGIN
task_1;
task_2;
....
….
task_n;
PAREND
task 1
task n
PAREND
45
TI1400/11-PDS
TU-Delft
Shared variables (1/4)
Task_1
.....
STW R2, SUM(0)
.....
SUM
T1
Task_2
.....
STW R2, SUM(0)
.....
shared memory
T2
46
TI1400/11-PDS
TU-Delft
Shared variables (2/4)
• Suppose processsors both 1 and 2 execute:
LW
A,R0
ADD
STW
R1,R0
R0,A
/* A is variable in main memory */
• Initially:
- A=100
- R1 in processor 1 is 20
- R1 in processor 2 is 40
• What is the final value of A? 120, 140, 160?
Now consider the final value of A is
your bank account balance.
47
TI1400/11-PDS
TU-Delft
Shared variables (3/4)
• So there is a need for mutual exclusion:
- different components of the same program need
exclusive access to a data structure to ensure
consistent values
• Occurs in many situations:
- access to shared variables
- access to a printer
• A solution: a single instruction (Test&Set) that
- tests whether somebody else accesses the variable
- if so, continue testing (busy waiting)
- if not, indicates that the variable is being accessed
48
TI1400/11-PDS
TU-Delft
Shared variables (4/4)
Task_1
crit:
T&S LOCK,crit
......
STW R2, SUM(0)
.....
CLR LOCK
SUM
Task_2
crit:
T&S LOCK,crit
......
STW R2, SUM(0)
.....
CLR LOCK
shared memory
LOCK
T1
T2
49
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers [earlier, see Token Ring et al.]
A Programmer’s View
Performance Considerations
50
TI1400/11-PDS
TU-Delft
Example program
Compute dot product of two vectors with a
•
1. sequential program
2. two tasks with shared memory
3. two tasks with distributed memory using messages
Primitives in parallel programs:
•
-
create_thread() (create a (sub)process)
-
mypid() (who am I?)
51
TI1400/11-PDS
TU-Delft
Sequential program
integer array a[1..N], b[1..N]
integer dot_product
dot_product :=0;
do_dot(a,b)
print do_product
procedure do_dot(integer array x[1..N], integer array y[1..N])
for k:=1 to N
dot_product := dot_product + x[k]*y[k]
end
end
52
TI1400/11-PDS
TU-Delft
Shared memory program 1 (1/2)
shared
shared
shared
shared
integer array a[1..N], b[1..N]
integer dot_product
lock dot_product_lock
barrier done
dot_product :=0;
create_thread(do_dot,a,b)
do_dot(a,b)
print dot_product
id=0
id=1
dot_product
barrier
53
TI1400/11-PDS
TU-Delft
Shared memory program 1 (2/2)
procedure do_dot(integer array x[1..N], integer array y[1..N])
private integer id
id := mypid(); /* who am I? */
for k:=(id*N/2)+1 to (id+1)*N/2 k in [1..N/2] or
[N/2+1..N]
lock(dot_product_lock)
Critical section dot_product := dot_product + x[k]*y[k]
unlock(dot_product_lock)
end
barrier(done)
end
54
TI1400/11-PDS
TU-Delft
Shared memory program 2 (1/2)
procedure do_dot(integer array x[1..N], integer array y[1..N])
private integer id, local_dot_product
id := mypid();
local_dot_product :=0;
local computation
for k:=(id*N/2)+1 to (id+1)*N/2 (can exec in parallel)
local_dot_product := local_ dot_product + x[k]*y[k]
end
lock(dot_product_lock) access shared variable (mutex)
dot_product := dot_product + local_dot_product
unlock(dot_product_lock)
barrier (done)
55
end
TI1400/11-PDS
TU-Delft
Shared memory program 2 (2/2)
id=0
id=1
local_dot_product
local_dot_product
dot_product
barrier
56
TI1400/11-PDS
TU-Delft
Message passing program (1/3)
integer array a[1..N/2], temp_a [1..N/2],
b[1..N/2], temp_b [1..N/2]
integer dot_product, id, temp
P0
P1
id := mypid();
if (id=0) then
send(temp_a[1..N/2], 1); /* send second halves */
send(temp_b[1..N/2], 1); /* of the two arrays to 1 */
else
receive(a[1..N/2], 0); /* receive second halves of */
receive(b[1..N/2], 0); /* the two arrays from proc.0*/
end
57
TI1400/11-PDS
TU-Delft
Message passing program (2/3)
dot_product :=0;
do_dot(a,b) /* arrays of length N/2 */
if (id =1) send (dot_product,0)
else receive (temp,1)
dot_product := dot_product +temp
print dot_product
end
58
TI1400/11-PDS
TU-Delft
Message passing program (3/3)
id=0
id=1
data
temp_a/b
local_dot_product
local_dot_product
result
dot_product
59
TI1400/11-PDS
TU-Delft
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
The Flynn Classification of Computers
Types of Multi-Processors
Interconnection Networks
Memory Organization in Multi-Processors
Program Parallelism and Shared Variables
Multi-Computers [Book]
A Programmer’s View
Performance Considerations (Amdahl’s Law)
60
TI1400/11-PDS
TU-Delft
Speedup
• Let TP be the time needed to execute a
program on P processors
• Speedup:
SP = T1/TP
• Ideal:
SP=P
(linear speedup)
• Usually: sublinear speedup due to
communication, algorithm, etc
• Sometimes: superlinear speedup
61
TI1400/11-PDS
TU-Delft
Amdahl’s law
• Suppose a program has:
If f = 0.95 (95%), then:
S16 = 16 / (16 - 0.95 x 15) = 9.14
S64 = 64 / (64 - 0.95 x 63) = 15.42
S1k ~ S1M ~ S100M < 20
-a parallelizable fraction f
-and so a sequential fraction 1-f
• Then (Amdahl’s law):
-SP = T1/TP =
= 1 / (1-f + f/P) =
= P / (P – f(P-1))
• Consequence:
- If 1-f significant,
cannot have TP→0
even when P→∞
Sequential
1-f
time
P
A
R
A
L
L
E
L
P
f/P
18.61818
62
TI1400/11-PDS
TU-Delft