CS 267: Applications of Parallel Computers Lecture 4: Shared Memory Multiprocessors Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.

Download Report

Transcript CS 267: Applications of Parallel Computers Lecture 4: Shared Memory Multiprocessors Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.

CS 267: Applications of Parallel Computers
Lecture 4:
Shared Memory Multiprocessors
Kathy Yelick
http://www-inst.eecs.berkeley.edu/~cs267
11/7/2015
CS267, Yelick
Basic Shared Memory Architecture
• Processors all connected to a large shared memory
• Local caches for each processor
• Cost: much cheaper to cache than main memory
P1
P2
$
Pn
$
$
network
memory
° Simple to program, but hard to scale
° Now take a closer look at structure, costs, limits
11/7/2015
CS267, Yelick
Programming Shared Memory (review)
• Program is a collection of threads of control.
• Each thread has a set of private variables
• e.g. local variables on the stack.
• Collectively with a set of shared variables
• e.g., static variables, shared common blocks, global heap.
• Communication and synchronization through shared variables
Address
:
Shared
y = ..x ...
i
res
s
P
11/7/2015
x = ...
Private
P
CS267, Yelick
...
P
i
res
s
Outline
• Historical perspective
• Bus-based machines
• Pentium SMP
• IBM SP node
• Directory-based (CC-NUMA) machine
• Origin 2000
• Global address space machines
• Cray t3d and (sort of) t3e
11/7/2015
CS267, Yelick
60s Mainframe Multiprocessors
• Enhance memory capacity or I/O capabilities by adding
memory modules or I/O devices
I/O
Devices
Mem
Mem
Mem
Mem
IOC
IOC
Interconnect
Proc
Proc
• How do you enhance processing capacity?
• Add processors
• Already need an interconnect between slow memory
M
banks and processor + I/O channels
• cross-bar or multistage interconnection network
M
M
M
P
P
IO
IO
70s Breakthrough: Caches
• Memory system scaled by adding memory modules
• Both bandwidth and capacity
• Memory was still a bottleneck
• Enter… Caches!
A:
memory (slow)
17
interconnect
processor (fast)
P
I/O Device
or
Processor
• Cache does two things:
• Reduces average access time (latency)
• Reduces bandwidth requirements to memory
Technology Perspective
Capacity
Speed
Logic: 2x in 3 years
2x in 3 years
DRAM: 4x in 3 years
1.4x in 10 years
Disk:
1.4x in 10 years
2x in 3 years
350
DRAM
Year 1000:1! Size
2:1! Cycle Time
300
SpecInt
250
SpecFP
200
1980
64 Kb
250 ns
150
1983
256 Kb
220 ns
100
1986
1 Mb
190 ns
50
1989
4 Mb
165 ns
1992
16 Mb
145 ns
1995
64 Mb
120 ns
0
1986
1988
1990
1992
Year
1994
1996
Approaches to Building Parallel
Machines
P1
Pn
Scale
Switch
(Interleaved)
First-level $
(Interleaved)
Main memory
P1
Pn
$
$
Interconnection network
Shared Cache
Mem
Mem
Centralized Memory
Dance Hall, UMA
Mem
Pn
P1
$
Mem
$
Interconnection network
11/7/2015
CS267, Yelick
Distributed Memory (NUMA)
80s Shared Memory: Shared Cache
• Alliant FX-8
• early 80’s
100000000
• eight 68020s with x-bar to 512 KB interleaved cache
• Encore & Sequent
• first 32-bit micros (N32032)
10000000
• two to a board with a shared cache
R10000
Pentium
P1
Pn
Switch
(Interleaved)
First-level $
Transistors
R4400
i80486
1000000
i80386
i80286
100000
R3010
i8086
(Interleaved)
Main memory
SU MIPS
i80x86
M68K
10000
MIPS
i4004
1000
1965
11/7/2015
1970
1975
CS267, Yelick
1980
1985
Year
1990
1995
2000
2005
Shared Cache: Advantages and Disadvantages
Advantages
• Cache placement identical to single cache
• only one copy of any cached block
• Fine-grain sharing is possible
• Interference
• One processor may prefetch data for another
• Can share data within a line without moving line
Disadvantages
• Bandwidth limitation
• Interference
• One processor may flush another processors data
11/7/2015
CS267, Yelick
Limits of Shared Cache Approach
I/O
MEM
140 MB/s
° ° ° MEM
°°°
cache
cache
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor
(32-bit)
=> 1.2 GB/s data BW at 30%
load-store
5.2 GB/s
PROC
PROC
Need 5.2 GB/s of bus bandwidth
per processor!
• Typical bus bandwidth is closer to
1 GB/s
Approaches to Building Parallel
Machines
P1
Pn
Scale
Switch
(Interleaved)
First-level $
(Interleaved)
Main memory
P1
Pn
$
$
Interconnection network
Shared Cache
Mem
Mem
Centralized Memory
Dance Hall, UMA
Mem
Pn
P1
$
Mem
$
Interconnection network
11/7/2015
CS267, Yelick
Distributed Memory (NUMA)
Intuitive Memory Model
• Reading an address should return the last value written
to that address
• Easy in uniprocessors
• except for I/O
• Cache coherence problem in MPs is more pervasive
and more performance critical
• More formally, this is called sequential consistency:
“A multiprocessor is sequentially consistent if the result of any
execution is the same as if the operations of all the processors
were executed in some sequential order, and the operations of
each individual processor appear in this sequence in the order
specified by its program.” [Lamport, 1979]
11/7/2015
CS267, Yelick
Cache Coherence: Semantic Problem
• p1 and p2 both have cached copies of x (as 0)
• p1 writes x=1
• May “write through” to memory
• p2 reads x, but gets the “stale” cached copy
x=0
x1
x0
x0
p1
11/7/2015
p2
CS267, Yelick
Cache Coherence: Semantic Problem
What does this imply about program behavior?
• No process ever sees “garbage” values, I.e., ½ of 2 values
• Processors always see values written by some some processor
• The value seen is constrained by program order on all processors
• Time always move forward
• Example:
• P1 writes x=1, then writes y=1
• P2 read y, then reads x
x=0
y=0
P1
x=1
y=1
11/7/2015
P2
=y
=x
CS267, Yelick
If P2 sees the new
value of y, it must see
the new value of x
Snoopy Cache-Coherence Protocols
State
Address
Data
Pn
P1
Bus snoop
$
$
Mem
I/O devices
Cache-memory
transaction
• Bus is a broadcast medium & caches know what they have
• Cache Controller “snoops” all transactions on the shared bus
• A transaction is a relevant transaction if it involves a cache block
currently contained in this cache
• take action to ensure coherence
•
invalidate, update, or supply value
• depends on state of the block and the protocol
11/7/2015
CS267, Yelick
Basic Choices in Cache Coherence
• Cache may keep information such as:
• Valid/invalid
• Dirty (inconsistent with memory)
• Shared (in another caches)
• When a processor executes a write operation to shared
data, basic design choices are:
• Write thru: do the write in memory as well as cache
• Write back: wait and do the write later, when the item is flushed
• Update: give all other processors the new value
• Invalidate: all other processors remove from cache
11/7/2015
CS267, Yelick
Example: Write-thru Invalidate
P2
P1
u=?
$
P3
3
u=?
4
$
5
$
u :5 u= 7
u :5
I/O devices
1
2
u:5
Memory
• Update and write-thru both use more memory
bandwidth if there are writes to the same address
• Update to the other caches
• Write-thru to memory
11/7/2015
CS267, Yelick
Write-Back/Ownership Schemes
• When a single cache has ownership of a block, processor
writes do not result in bus writes, thus conserving
bandwidth.
• reads by others cause it to return to “shared” state
• Most bus-based multiprocessors today use such
schemes.
• Many variants of ownership-based protocols
Sharing: A Performance Problem
• True sharing
• Frequent writes to a variable can create a bottleneck
• OK for read-only or infrequently written data
• Technique: make copies of the value, one per processor, if this
is possible in the algorithm
• Example problem: the data structure that stores the
freelist/heap for malloc/free
• False sharing
• Cache block may also introduce artifacts
• Two distinct variables in the same cache block
• Technique: allocate data used by each processor contiguously,
or at least avoid interleaving
• Example problem: an array of ints, one written frequently by
each processor
Limits of Bus-Based Shared Memory
I/O
MEM
140 MB/s
° ° ° MEM
Assume:
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
°°°
cache
cache
5.2 GB/s
PROC
PROC
Suppose 98% inst hit rate and 95% data
hit rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus
Engineering: Intel Pentium Pro Quad
CPU
P-Pro
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pro
module
P-Pro
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-way
interleaved
DRAM
SMP for the masses:
• All coherence and
multiprocessing glue in
processor module
• Highly integrated, targeted at
high volume
• Low latency and bandwidth
11/7/2015
CS267, Yelick
Engineering: SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interface/switch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
• 16 cards of either type
• All memory accessed over bus, so symmetric
• Higher bandwidth, higher latency bus
11/7/2015
CS267, Yelick
2 FiberChannel
SBUS
SBUS
• Proc + mem card - I/O card
SBUS
100bT, SCSI
Bus interface
Directory-Based Cache-Coherence
90 Scalable, Cache Coherent Multiprocessors
P1
Pn
Cache
Cache
Interconnection Netw
ork
memory block
Memory
dirty-bi t
Directory
presence bi ts
SGI Origin 2000
P
P
P
P
L2 c a che
(1-4 MB)
L2 c a che
(1-4 MB)
L2 c a che
(1-4 MB)
L2 c a che
(1-4 MB)
Xbow
Directory
Hub
Hub
Main
Memory
(1-4 GB)
Directory
Xbow
Main
Memory
(1-4 GB)
Interconnection Netw ork
•
•
•
•
•
•
Single 16”-by-11” PCB
Directory state in same or separate DRAMs, accessed in parallel
Up to 512 nodes ( 2 processors per node)
With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc
Peak SysAD bus bw is 780MB/s, so also Hub-Mem
Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)
Caches and Scientific Computing
• Caches tend to perform worst on demanding
applications that operate on large data sets
• transaction processing
• operating systems
• sparse matrices
• Modern scientific codes use tiling/blocking to become
cache friendly
• easier for dense codes than for sparse
• tiling and parallelism are similar transformations
Approaches to Building Parallel
Machines
P1
Pn
Scale
Switch
(Interleaved)
First-level $
(Interleaved)
Main memory
P1
Pn
$
$
Interconnection network
Shared Cache
Mem
Mem
Centralized Memory
Dance Hall, UMA
Mem
Pn
P1
$
Mem
$
Interconnection network
11/7/2015
CS267, Yelick
Distributed Memory (NUMA)
Scalable Global Address Space
Global Address Space: Structured Memory
Scalable Network
src rrsp tag
data
tag src addr read dest
°°°
Pseudo
Proc
Pseudo
Mem
M
$
$
mmu
P
P
M
mmu
Ld R<- Addr
• Processor performs load
• Pseudo-memory controller turns it into a message
transaction with a remote controller, which performs the
memory operation and replies with the data.
• Examples: BBN butterfly, Cray T3D
Cray T3D: Global Address Space machine
• 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network
•
•
•
•
• 43-bit virtual address space, 32-bit physical
• 32-bit and 64-bit load/store + byte manipulation on regs.
• no L2 cache
• non-blocking stores, load/store re-ordering, memory fence
• load-lock / store-conditional
Direct global memory access via external segment regs
• DTB annex, 32 entries, remote processor number and mode
• atomic swap between special local reg and memory
• special fetch&inc register
• global-OR, global-AND barriers
Prefetch Queue
Block Transfer Engine
User-level Message Queue
T3D Local Read (average latency)
600
500
8MB
No TLB !
L1 Cache Size:
8KB
2MB
1MB
512KB
ns
DRAM page
miss: 100 ns
(15 cycles)
4MB
Line Size:
32 bytes
400
256KB
300
128KB
64KB
32KB
200
Memory
Access
Time: 155 ns
100
(23 cycles)
16KB
Stride
4M
2M
1M
512K
256K
128K
64K
32K
16K
8K
4K
2K
1K
512
256
128
64
32
8
Cache
Access 0
Time: 6.7 ns
(1 cycle)
16
8KB
T3D Remote Read Uncached
3 - 4x Local Memory Read !
1000
900
100 ns
DRAM-page
miss
800
8MB
700
4MB
2MB
ns
610 ns
600
(91 cycles)
1MB
512KB
500
256KB
128KB
DEC Alpha
400
64KB
32KB
300
16KB
8KB
local T3D
200
100
4M
2M
1M
512K
256K
64K
32K
16K
8K
2K
1K
512
256
128
4K
Stride
128K
Network Latency:
Additional 13-20 ns
(2-3 cycles) per hop
64
32
16
8
0
Cray T3E
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Switch
Z
• Scales up to 1024 processors, 480MB/s links
• Memory system similar to t3d
• Memory controller generates request message for non-local references
• No hardware mechanism for coherence
• Somewhat less integrated
11/7/2015
CS267, Yelick
What to Take Away?
• Programming shared memory machines
• May allocate data in large shared region without too many
worries about where
• Memory hierarchy is critical to performance
•
Even more so than on uniprocs, due to coherence traffic
• For performance tuning, watch sharing (both true and false)
• Semantics
• Need to lock access to shared variable for read-modify-write
• Sequential consistency is the natural semantics
• Architects worked hard to make this work
•
•
Caches are coherent with buses or directories
No caching of remote data on shared address space machines
• But compiler and processor may still get in the way
11/7/2015
•
Non-blocking writes, CS267,
read prefetching,
code motion…
Yelick
Where are things going
• High-end
• collections of almost complete workstations/SMP on high-speed
network (Millennium, IBM SP machines)
• with specialized communication assist integrated with memory
system to provide global access to shared data (??)
• Mid-end
• almost all servers are bus-based CC SMPs
• high-end servers are replacing the bus with a network
•
•
Sun Enterprise 10000, Cray SV1, HP/Convex SPP
SGI Origin 2000
• volume approach is Pentium pro quadpack + SCI ring
•
Sequent, Data General
• Low-end
• SMP desktop is here
• Major change ahead
• SMP on a chip as a building block