Transcript Berkeley NOW
Networks: Switch Design
Switch Design
Input Ports Receiver Input Buffer Cross-bar Output Buffer Transmiter Output Ports Control Routing, Scheduling 2
I 1 I 2 I 3 I o
How do you build a crossbar
I o I 1 I 2 I 3 phase addr RAM Din Dout I o I 1 I 2 I 3 O 0 O i O 2 O 3 O 0 O i O 2 O 3 3
Input buffered swtich
Input Ports R0 R1 R2 R3 Cross-bar Scheduling Independent routing logic per input FSM Scheduler logic arbitrates each output priority, FIFO, random Head-of-line blocking problem Output Ports 4
Switches to avoid head-of-line blocking
Input Ports Output Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Control Additional cost Switch cycle time, routing delay How would you build a shared pool?
5
Example: IBM SP vulcan switch
8 Flow Control Input Port FIFO 8 CRC check Route control ° ° ° 8 Flow Control Input Port FIFO 8 CRC check Route control 64 64 ° ° ° 8 Central Queue In Arb Out Arb RAM 64x128 8 x 8 Crossbar 8 64 Ouput Port Flow Control 8 FIFO XBar Arb CRC Gen 8 Ouput Port ° ° ° Flow Control 8 FIFO XBar Arb CRC Gen 8 Many gigabit Ethernet switches use similar design without the cut-through 6
Output scheduling
R0 O0 Input Buffers R1 O1 R2 Cross-bar R3 O2 n independent arbitration problems?
static priority, random, round-robin simplifications due to routing algorithm?
Dimension order routing Adaptive routing general case is max bipartite matching Output Ports 7
Stacked Dimension Switches
Dimension order on 3D cube?
Zin Host In
2x2
Yin
2x2
Zout Yout Xin
2x2
Xout Host Out
8
Flow Control
What do you do when push comes to shove?
Ethernet: collision detection and retry after delay FDDI, token ring: arbitration token TCP/WAN: buffer, drop, adjust rate any solution must adjust to output rate Link-level flow control Ready
Data
9
Examples
Short Links F/E Ready/Ack Req F/E long links several flits on the wire Data 10
Smoothing the flow
Incoming Phits Flow-control Symbols Full Stop Go High Mark Low Mark Empty Outgoing Phits How much slack do you need to maximize bandwidth?
11
Link vs End-to-End flow control
Hot Spots back pressure all the buffers in the tree from the hot spot to the sources are full Global communication operations Simple back pressure with completely balanced communication patterns simple end-to-end protocols in the global communication have been shown to mitigate this problem a node may wait after sending a certain amount of data until it has also received this amount, or it may wait for chunks of its data to be acknowledged Admission Control NI-to-NI credit-based flow control keep the packet within the source NI rather than blocking traffic within the network 12
Example: T3D
R ead Req - no cach e - cache - pr efetch - fetch &in c Ro ute Tag Dest PE Co mm and Ad dr 0 Ad dr 1 Src PE R ead Resp Ro ute Tag Dest PE Co mm and Wor d 0 Read Resp - cached Wr ite Req - Proc - BLT 1 - fetch &inc Rou te Tag Dest PE Com m and Wor d 0 Wor d 1 Wor d 2 Wor d 3 Rou te Tag Dest PE Com m an d Add r 0 Add r 1 Sr c PE Wor d 0 Wri te Req - proc 4 - BLT 4 Route T ag Dest PE Com mand Addr 0 Addr 1 Sr c PE Word 0 Word 1 Word 2 Word 3 Packet T ype 3 req/ resp coom and 1 8 Wri te Resp R oute T ag D est PE C om mand B LT R ead Req R oute Tag D est PE C omm and A ddr 0 A ddr 1 Src PE A ddr 0 A ddr 1
3D bidirectional torus, dimension order (NIC selected ), virtual cut through, packet sw.
16 bit x 150 MHz, short, wide, synch.
rotating priority per output logically separate request/response (two VC’s each) 3 independent, stacked switches 8 16-bit flit buffers on each of 4 VC in each directions 13
Example: SP
16-node Rack E 0 E 1 E 2 E 3 Inter-Rack External Switch Ports E 15
Multi-rack Configuration Switch Board
P 0 P 1 P 2 P 3 Intra-Rack Host Ports P 15 8-port switch, 40 MB/s per link, 8-bit phit , 16-bit flit , single 40 MHz clock packet sw, cut-through, no virtual channel, source-based routing variable packet <= 255 bytes, 31 byte fifo per input, 7 bytes per output 128 8 byte ‘chunks’ in central queue, LRU per output run in shadow mode 14
Summary
Routing Algorithms restrict the set of routes within the topology simple mechanism selects turn at each hop arithmetic, selection, lookup Deadlock-free if channel dependence graph is acyclic limit turns to eliminate dependences add separate channel resources to break dependences combination of topology, algorithm, and switch design Deterministic vs adaptive routing Switch design issues input/output/pooled buffering, routing logic, selection logic Flow control Real networks are a ‘package’ of design choices 15
Cache Coherence in Scalable Machines
Context for Scalable Cache Coherence
Realizing Pgm Models through net transaction protocols - efficient node-to-net interface - interprets transactions
Scalable network
Scalable Networks - many simultaneous transactions
Switch Switch
Scalable distributed memory
M CA $ P
Caches naturally replicate data - coherence through bus snooping protocols - consistency Need
cache coherence protocols that scale!
- no broadcast or single point of order
Switch 17
Generic Solution: Directories
P1 Directory Memory Cache Comm.
Assist Directory Memory P1 Cache Comm Assist Scalable Interconnection Network Maintain state vector explicitly associate with memory block records state of block in each cache On miss, communicate with directory determine location of cached copies determine action to take conduct protocol to maintain coherence 18
A Cache Coherent System Must:
Provide
set of states
,
state transition diagram
, and
actions
Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to determine action whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (inval/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an “access fault” occurs on the line Different approaches distinguished by (a) to (c) 19
Bus-based Coherence
All of (a), (b), (c) done through
broadcast on bus
faulting processor sends out a “search” others respond to the search probe and take necessary action Could do it in scalable network too
broadcast to all processors, and let them respond
Conceptually simple, but broadcast
doesn’t scale with p
on bus, bus bandwidth doesn’t scale on scalable network, every fault leads to at least p network transactions Scalable coherence: can have same cache states and state transition diagram
different mechanisms to manage protocol
20
One Approach: Hierarchical Snooping
Extend snooping approach: hierarchy of broadcast media tree of buses or rings (KSR-1) processors are in the bus- or ring-based multiprocessors at the leaves parents and children connected by two-way snoopy interfaces snoop both buses and propagate relevant transactions main memory may be centralized at root or distributed among leaves Issues (a) - (c) handled similarly to bus, but not full broadcast faulting processor sends out “search” bus transaction on its bus propagates up and down hiearchy based on snoop results Problems: high latency: multiple levels, and snoop/lookup at every level bandwidth bottleneck at root Not popular today 21
Scalable Approach: Directories
Every memory block has associated directory information keeps track of
copies of cached blocks and their states
on a miss, find directory entry, look it up, and communicate only with the nodes that have
copies if necessary
in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing
directory information
22
Basic Operation of Directory
P Cache P Cache Interconnection Network • k processors. • With each cache-block in memory: k presence-bits, 1 dirty-bit • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit
• • •
M emory Directory presence bits dirty bit
• Read from main memory by processor i: • If
dirty-bit OFF
then { read from main memory; turn p[i] ON; } • if
dirty-bit ON
then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} • Write to main memory by processor i: • If
dirty-bit OFF
then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON; ... } • ...
23
Basic Directory Transactions
Requestor
C P 1.
Re a d re quest t o direc t ory A M/ D
Directory
node for block 3.
Re a d re q.
t o owne r 4a .
Da ta Re ply 2.
Re ply wi th owne r ide nt it y C P C P A M/ D 4b.
Revision m essa ge t o direc t ory A M/ D Node with
dirty
copy (a) R ead miss to a block in dirty state
Requestor
C P A M/ D 1.
RdEx reque st t o direc t ory 2.
Re ply wi th sha rers ide nti ty C P A M/ D 3a .
Inva l. re q.
t o sha re r 4a .
Inva l. a c k 3b.
Inva l. re q.
t o sha re r 4b.
Inva l. a c k
Directory
node C P A M/ D
Shar er
C P A M/ D
Shar er
(b) Write miss to a block with two sharers 24
A Popular Middle Ground
Two-level
“hierarchy”
Individual nodes are multiprocessors, connected non hierarchically e.g. mesh of SMPs Coherence across nodes is
directory-based
directory keeps track of nodes, not individual processors Coherence within nodes is
snooping or directory
orthogonal, but needs a good interface of functionality Examples: Convex Exemplar: directory-directory Sequent, Data General, HAL: directory-snoopy SMP on a chip?
25
Example Two-level Hierarchies
P P C Ma i n Me m C B1 Snooping Ada pt e r P C Snooping Ada pt e r B1 C P Ma i n Me m B2 (a) Snooping-snooping P P M/ D C A M/ D C A Ne twork1 Di re c tory ada pte r M/ D P C A M/ D C P A Ne twork1 Di re c tory ada pte r Ne twork2 (c) Directory-directory P C Di r.
B1 Ma i n Me m P C Assi st P C Assi st P C B1 Ma i n Me m Di r.
P M/ D C A M/ D P C A Ne twork1 Di r/ Snoopy a da pt e r Ne twork (b) Snooping-directory P M/ D C A M/ D C P A Ne twork1 Di r/ Snoopy a da pt e r Bus (or Ri ng) (d) Directory-snooping 26
Advantages of Multiprocessor Nodes
Potential for cost and performance advantages can use
commodity SMPs less nodes
for directory to keep track of
much communication
may be contained
within node (cheaper)
nodes
prefetch data
for each other (fewer “remote” misses) combining of requests (like hierarchical, only two-level) can even share caches (overlapping of working sets) benefits depend on sharing pattern (and mapping) good for widely read-shared: e.g. tree data in Barnes-Hut good for nearest-neighbor, if properly mapped not so good for all-to-all communication 27
Disadvantages of Coherent MP Nodes
Bandwidth shared
among nodes all-to-all example applies to coherent or not
Bus increases latency
to local memory With coherence, typically
wait for local snoop results
before sending remote requests Snoopy bus at remote node increases delays there too,
increasing latency and reducing bandwidth
May hurt performance
if sharing patterns don’t comply
28