Berkeley NOW

Download Report

Transcript Berkeley NOW

Networks: Switch Design

Switch Design

Input Ports Receiver Input Buffer Cross-bar Output Buffer Transmiter Output Ports Control Routing, Scheduling 2

I 1 I 2 I 3 I o

How do you build a crossbar

I o I 1 I 2 I 3 phase addr RAM Din Dout I o I 1 I 2 I 3 O 0 O i O 2 O 3 O 0 O i O 2 O 3 3

Input buffered swtich

Input Ports R0 R1 R2 R3 Cross-bar Scheduling    Independent routing logic per input  FSM Scheduler logic arbitrates each output  priority, FIFO, random Head-of-line blocking problem Output Ports 4

Switches to avoid head-of-line blocking

Input Ports Output Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Control   Additional cost  Switch cycle time, routing delay How would you build a shared pool?

Example: IBM SP vulcan switch

8 Flow Control Input Port FIFO 8 CRC check Route control ° ° ° 8 Flow Control Input Port FIFO 8 CRC check Route control 64 64 ° ° ° 8 Central Queue In Arb Out Arb RAM 64x128 8 x 8 Crossbar 8 64 Ouput Port Flow Control 8 FIFO XBar Arb CRC Gen 8 Ouput Port ° ° ° Flow Control 8 FIFO XBar Arb CRC Gen 8  Many gigabit Ethernet switches use similar design without the cut-through 6

Output scheduling

R0 O0 Input Buffers R1 O1 R2 Cross-bar R3 O2    n independent arbitration problems?

 static priority, random, round-robin simplifications due to routing algorithm?

  Dimension order routing Adaptive routing general case is max bipartite matching Output Ports 7

Stacked Dimension Switches

 Dimension order on 3D cube?

Zin Host In

2x2

Yin

2x2

Zout Yout Xin

2x2

Xout Host Out

Flow Control

  What do you do when push comes to shove?

    Ethernet: collision detection and retry after delay FDDI, token ring: arbitration token TCP/WAN: buffer, drop, adjust rate any solution must adjust to output rate Link-level flow control Ready

Data

Examples

 Short Links F/E Ready/Ack Req F/E  long links  several flits on the wire Data 10

Smoothing the flow

Incoming Phits Flow-control Symbols Full Stop Go High Mark Low Mark Empty Outgoing Phits  How much slack do you need to maximize bandwidth?

Link vs End-to-End flow control

   Hot Spots   back pressure all the buffers in the tree from the hot spot to the sources are full Global communication operations   Simple back pressure  with completely balanced communication patterns simple end-to-end protocols in the global communication have been shown to mitigate this problem  a node may wait after sending a certain amount of data until it has also received this amount, or it may wait for chunks of its data to be acknowledged Admission Control  NI-to-NI credit-based flow control  keep the packet within the source NI rather than blocking traffic within the network 12

Example: T3D

R ead Req - no cach e - cache - pr efetch - fetch &in c Ro ute Tag Dest PE Co mm and Ad dr 0 Ad dr 1 Src PE R ead Resp Ro ute Tag Dest PE Co mm and Wor d 0 Read Resp - cached Wr ite Req - Proc - BLT 1 - fetch &inc Rou te Tag Dest PE Com m and Wor d 0 Wor d 1 Wor d 2 Wor d 3 Rou te Tag Dest PE Com m an d Add r 0 Add r 1 Sr c PE Wor d 0 Wri te Req - proc 4 - BLT 4 Route T ag Dest PE Com mand Addr 0 Addr 1 Sr c PE Word 0 Word 1 Word 2 Word 3 Packet T ype 3 req/ resp coom and 1 8 Wri te Resp R oute T ag D est PE C om mand B LT R ead Req R oute Tag D est PE C omm and A ddr 0 A ddr 1 Src PE A ddr 0 A ddr 1

      3D bidirectional torus, dimension order (NIC selected ), virtual cut through, packet sw.

16 bit x 150 MHz, short, wide, synch.

rotating priority per output logically separate request/response (two VC’s each) 3 independent, stacked switches 8 16-bit flit buffers on each of 4 VC in each directions 13

Example: SP

16-node Rack E 0 E 1 E 2 E 3 Inter-Rack External Switch Ports E 15

Multi-rack Configuration Switch Board

P 0 P 1 P 2 P 3 Intra-Rack Host Ports P 15      8-port switch, 40 MB/s per link, 8-bit phit , 16-bit flit , single 40 MHz clock packet sw, cut-through, no virtual channel, source-based routing variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes per output 128 8 byte ‘chunks’ in central queue, LRU per output run in shadow mode 14

Summary

      Routing Algorithms restrict the set of routes within the topology  simple mechanism selects turn at each hop  arithmetic, selection, lookup Deadlock-free if channel dependence graph is acyclic    limit turns to eliminate dependences add separate channel resources to break dependences combination of topology, algorithm, and switch design Deterministic vs adaptive routing Switch design issues  input/output/pooled buffering, routing logic, selection logic Flow control Real networks are a ‘package’ of design choices 15

Cache Coherence in Scalable Machines

Context for Scalable Cache Coherence

Realizing Pgm Models through net transaction protocols - efficient node-to-net interface - interprets transactions

Scalable network

Scalable Networks - many simultaneous transactions

Switch Switch

Scalable distributed memory

M CA  $ P

Caches naturally replicate data - coherence through bus snooping protocols - consistency Need

cache coherence protocols that scale!

- no broadcast or single point of order

Switch 17

Generic Solution: Directories

P1 Directory Memory Cache Comm.

Assist Directory Memory P1 Cache Comm Assist Scalable Interconnection Network   Maintain state vector explicitly  associate with memory block  records state of block in each cache On miss, communicate with directory    determine location of cached copies determine action to take conduct protocol to maintain coherence 18

A Cache Coherent System Must:

    Provide

set of states

state transition diagram

, and

actions

Manage coherence protocol     (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to determine action  whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (inval/update) (0) is done the same way on all systems   state of the line is maintained in the cache protocol is invoked if an “access fault” occurs on the line Different approaches distinguished by (a) to (c) 19

Bus-based Coherence

    All of (a), (b), (c) done through

broadcast on bus

 faulting processor sends out a “search”  others respond to the search probe and take necessary action Could do it in scalable network too 

broadcast to all processors, and let them respond

Conceptually simple, but broadcast

doesn’t scale with p

 on bus, bus bandwidth doesn’t scale  on scalable network, every fault leads to at least p network transactions Scalable coherence:   can have same cache states and state transition diagram

different mechanisms to manage protocol

One Approach: Hierarchical Snooping

    Extend snooping approach: hierarchy of broadcast media     tree of buses or rings (KSR-1) processors are in the bus- or ring-based multiprocessors at the leaves parents and children connected by two-way snoopy interfaces  snoop both buses and propagate relevant transactions main memory may be centralized at root or distributed among leaves Issues (a) - (c) handled similarly to bus, but not full broadcast  faulting processor sends out “search” bus transaction on its bus  propagates up and down hiearchy based on snoop results Problems:  high latency: multiple levels, and snoop/lookup at every level  bandwidth bottleneck at root Not popular today 21

Scalable Approach: Directories

  Every memory block has associated directory information    keeps track of

copies of cached blocks and their states

on a miss, find directory entry, look it up, and communicate only with the nodes that have

copies if necessary

in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing

directory information

Basic Operation of Directory

P Cache P Cache Interconnection Network • k processors. • With each cache-block in memory: k presence-bits, 1 dirty-bit • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit

• • •

M emory Directory presence bits dirty bit

• Read from main memory by processor i: • If

dirty-bit OFF

then { read from main memory; turn p[i] ON; } • if

dirty-bit ON

then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} • Write to main memory by processor i: • If

dirty-bit OFF

then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON; ... } • ...

Basic Directory Transactions

Requestor

C P 1.

Re a d re quest t o direc t ory A M/ D

Directory

node for block 3.

Re a d re q.

t o owne r 4a .

Da ta Re ply 2.

Re ply wi th owne r ide nt it y C P C P A M/ D 4b.

Revision m essa ge t o direc t ory A M/ D Node with

dirty

copy (a) R ead miss to a block in dirty state

Requestor

C P A M/ D 1.

RdEx reque st t o direc t ory 2.

Re ply wi th sha rers ide nti ty C P A M/ D 3a .

Inva l. re q.

t o sha re r 4a .

Inva l. a c k 3b.

Inva l. re q.

t o sha re r 4b.

Inva l. a c k

Directory

node C P A M/ D

Shar er

C P A M/ D

Shar er

(b) Write miss to a block with two sharers 24

A Popular Middle Ground

      Two-level

“hierarchy”

Individual nodes are multiprocessors, connected non hierarchically  e.g. mesh of SMPs Coherence across nodes is

directory-based

 directory keeps track of nodes, not individual processors Coherence within nodes is

snooping or directory

 orthogonal, but needs a good interface of functionality Examples:   Convex Exemplar: directory-directory Sequent, Data General, HAL: directory-snoopy SMP on a chip?

Example Two-level Hierarchies

P P C Ma i n Me m C B1 Snooping Ada pt e r P C Snooping Ada pt e r B1 C P Ma i n Me m B2 (a) Snooping-snooping P P M/ D C A M/ D C A Ne twork1 Di re c tory ada pte r M/ D P C A M/ D C P A Ne twork1 Di re c tory ada pte r Ne twork2 (c) Directory-directory P C Di r.

B1 Ma i n Me m P C Assi st P C Assi st P C B1 Ma i n Me m Di r.

P M/ D C A M/ D P C A Ne twork1 Di r/ Snoopy a da pt e r Ne twork (b) Snooping-directory P M/ D C A M/ D C P A Ne twork1 Di r/ Snoopy a da pt e r Bus (or Ri ng) (d) Directory-snooping 26

Advantages of Multiprocessor Nodes

 Potential for cost and performance advantages        can use

commodity SMPs less nodes

for directory to keep track of

much communication

may be contained

within node (cheaper)

nodes

prefetch data

for each other (fewer “remote” misses) combining of requests (like hierarchical, only two-level) can even share caches (overlapping of working sets) benefits depend on sharing pattern (and mapping)    good for widely read-shared: e.g. tree data in Barnes-Hut good for nearest-neighbor, if properly mapped not so good for all-to-all communication 27

Disadvantages of Coherent MP Nodes

    

Bandwidth shared

among nodes   all-to-all example applies to coherent or not

Bus increases latency

to local memory With coherence, typically

wait for local snoop results

before sending remote requests Snoopy bus at remote node increases delays there too,

increasing latency and reducing bandwidth

May hurt performance

if sharing patterns don’t comply