Document

Transcript Document

Parallel Programming Platforms • Microprocessor architecture and how to make it parallel • Pipelining,superscalar architecture • Vliw • Microprocessor memory and how to make it parallel • Caches, etc

platforms • Architecture • Networks (ad nauseum) • Cache coherence • Routing • Mapping processes to processors (scary)

Problems • Data dependency – Load r1,@1000 and add r1,@1004 – Resolved in hardware at runtime (complex) – Depends on coding style (arghh!) • Resource dependency – 2 instructions need floating point unit • Branch dependency – Speculative scheduling and rollbacks – Dynamic instruction issue (chose instructions from window)

VLIW processors • Waste lives on • Vertical -no instructions in cycle • Horizontal -parts of instructions • So use the compiler • Detect dependencies • Schedule instructions » Unroll loops, branch predictions,speculative execution

Memory problems • Memory performance • Latency • Bandwith • Cache • Faster memory between processor and dram • Cache works if there is repeated reference to same data item-temporal locality

Spatial locality

More memory tricks • Multithreading • Decreases latency • Can increase bandwith because of small cache residency • Prefetching • Advance loads • Compilers aggressively advance loads • Decreases latency, can increase bandwith • Both require more hardware

Parallel platforms • Logical organization-how programmer sees things • Control-how to express parallel tasks • Communication model-how tasks interact • Physical organization-hardware

Control Structure

SIMD,MIMD,SPMD • SIMD • Simple architecture • Structured computations-arrays • Early machines-Illiac IV,MPP,CM2 • Become obsolete quickly (processors change) • MIMD • Easy to build from off the shelf stuff • SPMD single program, multiple data stream

Communication Model Shared Address Space • Multiprocessors- SPMD+shared address space • UMA multicomputer-same time to access all words • NUMA multicomputer-some words take more time

Shared memory pictures

Shared memory • Read-only instructions same as serial computers • Read-write require mutual exclusion – Threads (posix) and open MP use locks • Cache coherence necessary – Serious hardware

Physical organization • PRAM, parallel random access • infinite memory, uniform access time to same memory space • Common clock, processors execute different instructions in same cycle

PRAM memory access models • EREW exclusive read exclusive write – No concurrent reads or writes – Weakest model • CREW concurrent read, exclusive write – Writes are serialized • ERCW exclusive read, concurrent write – Reads are serialized • CRCW concurrent read, concurrent write – Most powerful model, anything goes

Processor networks static and dynamic

Switches and interfaces • Map input ports to output ports – Crossbars, multi-ported memory, buffer messages • Route messages • Network interfaces – Packetize data – Routes messages – Error check – Buffer messages – Lives on memory bus in big machines

Bus networks • Cheap • Ideal for broadcast • Shared medium limits size to dozens of nodes • Sun enterprise servers, intel pentium • Caching-reduces demand for bandwith

To cache or not to cache

Crossbar switch scales for performance, cost doesn’t scale

Multistage Network cost vs performance solution

Omega network • Log p stages , p=# processing • number of inputs=number of outputs=p J= The perfect shuffle 2i 0≤i≤p/2-1 2i+1 p p/2≤i≤p-1

The perfect shuffle picture

Omega Network • Perfect shuffle feeds p/2 switches at each stage • Each stage has 2 connection modes – Pass through – Cross over • P/2 x log p switching nodes

Omega Network

Omega Network-blocking • Path from processor 2 to memory bank 7 is blocked for other processors

Completely Connected and Star Networks • Highly realistic

Real networks

How to build a hypercube connect 2 (d-1) dimensional hypercubes to get a d dimensional hypercube

Tree Networks • Static-processors at nodes • Dynamic-switches at nodes, processors at leaves – Messages go up and down tree

Fat tree • More switches, less traffic congestion

Evaluation of static networks • Diameter-max distance between 2 processing nodes » Distance is shortest path • Connectivity-how many paths between nodes? » Arc connectivity is the cut-set • Bisection width-# links to remove to get two equal size networks • Channel width-#wires on a link • Channel rate-peak bit rate for a wire • Channel bandwith-peak rate for a link • Cost-number of links or the number of wires

Static networks

Dynamic Networks • Treat switches as nodes, same definitions for diameter and connectivity • Bisection width-nodes only

Cache coherence in multiprocessor systems • How to keep caches and memory consistent • Update-change variable in cache, update memory and remote caches ( not cheap ) • Invalidate-change variable, invalidate other copies ( other processors might not use the variable )

Invalidate , Update

Tradeoffs and false sharing • Communication overhead of updates vs. idle time of invalidates • False sharing-different processors update different parts of same cache line • Invalidate-entire line is taken out by processor • Other processors have to import their parts from remote processor

How the invalidate protocol works • Initially x is in global memory (shared) • P0 does a store and marks other copies invalid and marks its own as dirty . • If P1 attempts to write to variable, P0 first updates the variable (it becomes shared again).

Snoopy Caches – Bus-based – Each processor keeps tags for data items – Listens for reads, writes to its dirty items and takes over

Snoopy • The good-can do local mods without generating traffic • The bad-doesn’t scale to lots of processors • Summary- Bus based

Directory based • Bit map indicates state of data

Directory • Centralized directory can be a bottleneck • Distributed directory scales better • Get concurrent update possibility but • Need update messages to other processors

Communication Costs • Start-up time (t s ) • Per-hop time (t h ) Latency-time for first bit to arrive at next node • Per-word transfer time (t w ) t w =1/r, r=bandwith

Routing Store and forward, Packet routing

Cut through design goals • All packets take same path • Error information at message, not packet level • Low cost error detection methods

Cut through design • Flits- flow control digits • Very small, 4 bits to 32 bytes • Send tracer flit to establish route • All flits follow thereafter • Passed through intermediate nodes, not stored

Deadlock among the flits

Simple Cost Model for sending messages • Model t comm =t s +lt h +t w m • Should » Communicate in bulk » Minimize amount sent » Minimize distance of transfer • Can try to do the first two

Cost model • Hard to minimize distance of transfer » Programmer has little control over process mapping » Random (two step) routing. First send to randomly selected node, then route to destination-minimizes congestion. » T h dominated by t s and t w • New model-ignore latency • Don’t think about topology when designing algorithms • Unless there is congestion

Why it is hard to model communication costs in shared address machines does Programmer doesn’t control memory-the system Cache thrashing Invalidate and update overheads hard to model Spacial locality hard to model Prefetching (done by compiler) False sharing Contention

Solution • Back of envelope model – Assume that remote access results in word being fetched into local cache – Subsume all coherence, network, memory overheads are included in t s (the access cost) – Per word access cost of t w includes cost of remote (vs local) access • Cost of sharing m words=t s +t w m – The same model

Routing Algorithms • Deterministic algorithms • Adaptive algorithms • Dimension ordered routing – Next hop determined by dimension – Mesh- xy routing – Hypercube- E-cube routing

E Cube Routing • Compute P s XOR P d – Number of ones = minimum distance • Send message on dimension k=position of least significant bit in P s • Intermediate node P i and does the same XOR P d computes P i XOR P d

E Cube Routing

Mapping Techniques for Graphs • Why? Because you might have to port a program from one machine to another • Need to know how the communication patterns of the algorithm will be affected – Some mappings might produce congestion

Embed Linear Array into Hypercube • Linear array of 2 d nodes • D dimensional hypercube • Map node i of linear array onto node G(i,d) of hypercube • G is the binary reflected Gray code

Mapping linear array into hypercube • Get gray code of dimension d+1 from gray code of dimension d by reflecting and prefixing (0=old,1=new)