CS 267: Applications of Parallel Computers Lecture 4: Shared Memory Multiprocessors Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.
Download ReportTranscript CS 267: Applications of Parallel Computers Lecture 4: Shared Memory Multiprocessors Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.
CS 267: Applications of Parallel Computers Lecture 4: Shared Memory Multiprocessors Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick Basic Shared Memory Architecture • Processors all connected to a large shared memory • Local caches for each processor • Cost: much cheaper to cache than main memory P1 P2 $ Pn $ $ network memory ° Simple to program, but hard to scale ° Now take a closer look at structure, costs, limits 11/7/2015 CS267, Yelick Programming Shared Memory (review) • Program is a collection of threads of control. • Each thread has a set of private variables • e.g. local variables on the stack. • Collectively with a set of shared variables • e.g., static variables, shared common blocks, global heap. • Communication and synchronization through shared variables Address : Shared y = ..x ... i res s P 11/7/2015 x = ... Private P CS267, Yelick ... P i res s Outline • Historical perspective • Bus-based machines • Pentium SMP • IBM SP node • Directory-based (CC-NUMA) machine • Origin 2000 • Global address space machines • Cray t3d and (sort of) t3e 11/7/2015 CS267, Yelick 60s Mainframe Multiprocessors • Enhance memory capacity or I/O capabilities by adding memory modules or I/O devices I/O Devices Mem Mem Mem Mem IOC IOC Interconnect Proc Proc • How do you enhance processing capacity? • Add processors • Already need an interconnect between slow memory M banks and processor + I/O channels • cross-bar or multistage interconnection network M M M P P IO IO 70s Breakthrough: Caches • Memory system scaled by adding memory modules • Both bandwidth and capacity • Memory was still a bottleneck • Enter… Caches! A: memory (slow) 17 interconnect processor (fast) P I/O Device or Processor • Cache does two things: • Reduces average access time (latency) • Reduces bandwidth requirements to memory Technology Perspective Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 1.4x in 10 years 2x in 3 years 350 DRAM Year 1000:1! Size 2:1! Cycle Time 300 SpecInt 250 SpecFP 200 1980 64 Kb 250 ns 150 1983 256 Kb 220 ns 100 1986 1 Mb 190 ns 50 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 0 1986 1988 1990 1992 Year 1994 1996 Approaches to Building Parallel Machines P1 Pn Scale Switch (Interleaved) First-level $ (Interleaved) Main memory P1 Pn $ $ Interconnection network Shared Cache Mem Mem Centralized Memory Dance Hall, UMA Mem Pn P1 $ Mem $ Interconnection network 11/7/2015 CS267, Yelick Distributed Memory (NUMA) 80s Shared Memory: Shared Cache • Alliant FX-8 • early 80’s 100000000 • eight 68020s with x-bar to 512 KB interleaved cache • Encore & Sequent • first 32-bit micros (N32032) 10000000 • two to a board with a shared cache R10000 Pentium P1 Pn Switch (Interleaved) First-level $ Transistors R4400 i80486 1000000 i80386 i80286 100000 R3010 i8086 (Interleaved) Main memory SU MIPS i80x86 M68K 10000 MIPS i4004 1000 1965 11/7/2015 1970 1975 CS267, Yelick 1980 1985 Year 1990 1995 2000 2005 Shared Cache: Advantages and Disadvantages Advantages • Cache placement identical to single cache • only one copy of any cached block • Fine-grain sharing is possible • Interference • One processor may prefetch data for another • Can share data within a line without moving line Disadvantages • Bandwidth limitation • Interference • One processor may flush another processors data 11/7/2015 CS267, Yelick Limits of Shared Cache Approach I/O MEM 140 MB/s ° ° ° MEM °°° cache cache Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store 5.2 GB/s PROC PROC Need 5.2 GB/s of bus bandwidth per processor! • Typical bus bandwidth is closer to 1 GB/s Approaches to Building Parallel Machines P1 Pn Scale Switch (Interleaved) First-level $ (Interleaved) Main memory P1 Pn $ $ Interconnection network Shared Cache Mem Mem Centralized Memory Dance Hall, UMA Mem Pn P1 $ Mem $ Interconnection network 11/7/2015 CS267, Yelick Distributed Memory (NUMA) Intuitive Memory Model • Reading an address should return the last value written to that address • Easy in uniprocessors • except for I/O • Cache coherence problem in MPs is more pervasive and more performance critical • More formally, this is called sequential consistency: “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] 11/7/2015 CS267, Yelick Cache Coherence: Semantic Problem • p1 and p2 both have cached copies of x (as 0) • p1 writes x=1 • May “write through” to memory • p2 reads x, but gets the “stale” cached copy x=0 x1 x0 x0 p1 11/7/2015 p2 CS267, Yelick Cache Coherence: Semantic Problem What does this imply about program behavior? • No process ever sees “garbage” values, I.e., ½ of 2 values • Processors always see values written by some some processor • The value seen is constrained by program order on all processors • Time always move forward • Example: • P1 writes x=1, then writes y=1 • P2 read y, then reads x x=0 y=0 P1 x=1 y=1 11/7/2015 P2 =y =x CS267, Yelick If P2 sees the new value of y, it must see the new value of x Snoopy Cache-Coherence Protocols State Address Data Pn P1 Bus snoop $ $ Mem I/O devices Cache-memory transaction • Bus is a broadcast medium & caches know what they have • Cache Controller “snoops” all transactions on the shared bus • A transaction is a relevant transaction if it involves a cache block currently contained in this cache • take action to ensure coherence • invalidate, update, or supply value • depends on state of the block and the protocol 11/7/2015 CS267, Yelick Basic Choices in Cache Coherence • Cache may keep information such as: • Valid/invalid • Dirty (inconsistent with memory) • Shared (in another caches) • When a processor executes a write operation to shared data, basic design choices are: • Write thru: do the write in memory as well as cache • Write back: wait and do the write later, when the item is flushed • Update: give all other processors the new value • Invalidate: all other processors remove from cache 11/7/2015 CS267, Yelick Example: Write-thru Invalidate P2 P1 u=? $ P3 3 u=? 4 $ 5 $ u :5 u= 7 u :5 I/O devices 1 2 u:5 Memory • Update and write-thru both use more memory bandwidth if there are writes to the same address • Update to the other caches • Write-thru to memory 11/7/2015 CS267, Yelick Write-Back/Ownership Schemes • When a single cache has ownership of a block, processor writes do not result in bus writes, thus conserving bandwidth. • reads by others cause it to return to “shared” state • Most bus-based multiprocessors today use such schemes. • Many variants of ownership-based protocols Sharing: A Performance Problem • True sharing • Frequent writes to a variable can create a bottleneck • OK for read-only or infrequently written data • Technique: make copies of the value, one per processor, if this is possible in the algorithm • Example problem: the data structure that stores the freelist/heap for malloc/free • False sharing • Cache block may also introduce artifacts • Two distinct variables in the same cache block • Technique: allocate data used by each processor contiguously, or at least avoid interleaving • Example problem: an array of ints, one written frequently by each processor Limits of Bus-Based Shared Memory I/O MEM 140 MB/s ° ° ° MEM Assume: 1 GHz processor w/o cache => 4 GB/s inst BW per processor (32-bit) => 1.2 GB/s data BW at 30% load-store °°° cache cache 5.2 GB/s PROC PROC Suppose 98% inst hit rate and 95% data hit rate => 80 MB/s inst BW per processor => 60 MB/s data BW per processor 140 MB/s combined BW Assuming 1 GB/s bus bandwidth \ 8 processors will saturate bus Engineering: Intel Pentium Pro Quad CPU P-Pro module 256-KB Interrupt L2 $ controller Bus interface P-Pro module P-Pro module PCI bridge PCI bus PCI I/O cards PCI bridge PCI bus P-Pro bus (64-bit data, 36-bit address, 66 MHz) Memory controller MIU 1-, 2-, or 4-way interleaved DRAM SMP for the masses: • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth 11/7/2015 CS267, Yelick Engineering: SUN Enterprise P $ P $ $2 $2 CPU/mem cards Mem ctrl Bus interface/switch Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards • 16 cards of either type • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus 11/7/2015 CS267, Yelick 2 FiberChannel SBUS SBUS • Proc + mem card - I/O card SBUS 100bT, SCSI Bus interface Directory-Based Cache-Coherence 90 Scalable, Cache Coherent Multiprocessors P1 Pn Cache Cache Interconnection Netw ork memory block Memory dirty-bi t Directory presence bi ts SGI Origin 2000 P P P P L2 c a che (1-4 MB) L2 c a che (1-4 MB) L2 c a che (1-4 MB) L2 c a che (1-4 MB) Xbow Directory Hub Hub Main Memory (1-4 GB) Directory Xbow Main Memory (1-4 GB) Interconnection Netw ork • • • • • • Single 16”-by-11” PCB Directory state in same or separate DRAMs, accessed in parallel Up to 512 nodes ( 2 processors per node) With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc Peak SysAD bus bw is 780MB/s, so also Hub-Mem Hub to router chip and to Xbow is 1.56 GB/s (both are off-board) Caches and Scientific Computing • Caches tend to perform worst on demanding applications that operate on large data sets • transaction processing • operating systems • sparse matrices • Modern scientific codes use tiling/blocking to become cache friendly • easier for dense codes than for sparse • tiling and parallelism are similar transformations Approaches to Building Parallel Machines P1 Pn Scale Switch (Interleaved) First-level $ (Interleaved) Main memory P1 Pn $ $ Interconnection network Shared Cache Mem Mem Centralized Memory Dance Hall, UMA Mem Pn P1 $ Mem $ Interconnection network 11/7/2015 CS267, Yelick Distributed Memory (NUMA) Scalable Global Address Space Global Address Space: Structured Memory Scalable Network src rrsp tag data tag src addr read dest °°° Pseudo Proc Pseudo Mem M $ $ mmu P P M mmu Ld R<- Addr • Processor performs load • Pseudo-memory controller turns it into a message transaction with a remote controller, which performs the memory operation and replies with the data. • Examples: BBN butterfly, Cray T3D Cray T3D: Global Address Space machine • 2048 Alphas (150 MHz, 16 or 64 MB each) + fast network • • • • • 43-bit virtual address space, 32-bit physical • 32-bit and 64-bit load/store + byte manipulation on regs. • no L2 cache • non-blocking stores, load/store re-ordering, memory fence • load-lock / store-conditional Direct global memory access via external segment regs • DTB annex, 32 entries, remote processor number and mode • atomic swap between special local reg and memory • special fetch&inc register • global-OR, global-AND barriers Prefetch Queue Block Transfer Engine User-level Message Queue T3D Local Read (average latency) 600 500 8MB No TLB ! L1 Cache Size: 8KB 2MB 1MB 512KB ns DRAM page miss: 100 ns (15 cycles) 4MB Line Size: 32 bytes 400 256KB 300 128KB 64KB 32KB 200 Memory Access Time: 155 ns 100 (23 cycles) 16KB Stride 4M 2M 1M 512K 256K 128K 64K 32K 16K 8K 4K 2K 1K 512 256 128 64 32 8 Cache Access 0 Time: 6.7 ns (1 cycle) 16 8KB T3D Remote Read Uncached 3 - 4x Local Memory Read ! 1000 900 100 ns DRAM-page miss 800 8MB 700 4MB 2MB ns 610 ns 600 (91 cycles) 1MB 512KB 500 256KB 128KB DEC Alpha 400 64KB 32KB 300 16KB 8KB local T3D 200 100 4M 2M 1M 512K 256K 64K 32K 16K 8K 2K 1K 512 256 128 4K Stride 128K Network Latency: Additional 13-20 ns (2-3 cycles) per hop 64 32 16 8 0 Cray T3E External I/O P $ Mem Mem ctrl and NI XY Switch Z • Scales up to 1024 processors, 480MB/s links • Memory system similar to t3d • Memory controller generates request message for non-local references • No hardware mechanism for coherence • Somewhat less integrated 11/7/2015 CS267, Yelick What to Take Away? • Programming shared memory machines • May allocate data in large shared region without too many worries about where • Memory hierarchy is critical to performance • Even more so than on uniprocs, due to coherence traffic • For performance tuning, watch sharing (both true and false) • Semantics • Need to lock access to shared variable for read-modify-write • Sequential consistency is the natural semantics • Architects worked hard to make this work • • Caches are coherent with buses or directories No caching of remote data on shared address space machines • But compiler and processor may still get in the way 11/7/2015 • Non-blocking writes, CS267, read prefetching, code motion… Yelick Where are things going • High-end • collections of almost complete workstations/SMP on high-speed network (Millennium, IBM SP machines) • with specialized communication assist integrated with memory system to provide global access to shared data (??) • Mid-end • almost all servers are bus-based CC SMPs • high-end servers are replacing the bus with a network • • Sun Enterprise 10000, Cray SV1, HP/Convex SPP SGI Origin 2000 • volume approach is Pentium pro quadpack + SCI ring • Sequent, Data General • Low-end • SMP desktop is here • Major change ahead • SMP on a chip as a building block