Document 7616512

Download Report

Transcript Document 7616512

Cache Coherence and Interconnection
Network Design
Adapted from UC, Berkeley Notes
Cache-based (Pointer-based) Schemes
• How they work:
– home only holds pointer to rest of directory info
– distributed linked list of copies, weaves through caches
» cache tag has pointer, points to next cache with a
copy
– on read, add yourself to head of the list (comm. needed)
– on write, propagate chain of invalidations down the list
• What if a link fails? => Scalable Coherent
Interface (SCI) IEEE Standard
– doubly linked list
– What to do on replacement?
Main Memory
(Hom e)
Node 0
5/25/2016
Node 1
Node 2
P
P
P
Cache
Cache
Cache
CS258 S99
2
Scaling Properties (Cache-based)
• Traffic on write: proportional to number of
sharers
• Latency on write: proportional to number of
sharers!
– don’t know identity of next sharer until reach current one
– also assist processing at each node along the way
– (even reads involve more than one other assist: home
and first sharer on list)
• Storage overhead: quite good scaling along
both axes
– Only one head pointer per memory block
» rest is all prop to cache size
• Very complex!!!
5/25/2016
CS258 S99
3
Hierarchical Directories
processing nodes
level-1 directory
(Tracks which of its children
level-2 directory
level-1 directories have a copy
of the memory block.Also tracks
which local memory blocks are
cached outside this subtree.
Inclusion is maintained between
level-1 directories and level-2 directory. )
(Tracks which of its children
processing nodes have a copy
of the memory block.Also tracks
which local memory blocks are
cached outside this subtree.
Inclusion is maintained between
processor caches and directory. )
• Directory is a hierarchical data structure
– leaves are processing nodes, internal nodes just directory
– logical hierarchy, not necessarily physical
» (can be embedded in general network)
5/25/2016
CS258 S99
4
Caching in the Internet (Server side caching, Client side
caching, Network caching)
5/25/2016
CS258 S99
5
Network Caching – Proxy Caching –
Ref: Jia Wang, “A Survey of Web Caching Schemes for the Internet”
ACM Computing Surveys
5/25/2016
CS258 S99
6
Comparison with P2P
• Directory is similar to tracker in Bit Torrent
(BT) or root in P2P network – Plaxton’s root
has one pointer and BT has all pointers
• Real object may be somewhere else (call it as
home node), as pointed by directory
• The shared caches are extra locations of the
object equivalent to peers
• Duplicate directories possible – see
hierarchical directory design (MIND) later
Lot of possible designs for directory – Can we
do similar design in P2P?
5/25/2016
CS258 S99
7
Plaxton’s Scheme
•Similar to Distributed Shared Memory (DSM) Cache Protocol
•Root is the Home node in DSM, where directory is kept but not the real object. The
root points to the nearest home peer, where a shared copy can be found.
•Similar to Hierarchical Directory scheme, where intermediate nodes point to
nearest object nodes (equivalent to shared copy). However, in hierarchical directory
scheme Intermediate pointers point to all shared copies.
5/25/2016
CS258 S99
8
P2P Research
• How many pointers do we need at the
tracker? Cost model? Depends on
popularity? How many peers to supply to
client? Communication model? How to
update directory?
• Cache-based (Linked-list) approach like SCI?
• Hierarchical directory – Plaxton’s paper –
Introduce such method for BT network?
• Combine directory+tree DHTs?
• Is Dirty copy possible in P2P? Then develop
protocol similar to cache
• Sub-page coherence like DSM?
5/25/2016
CS258 S99
9
Interconnection Network Design
Adapted from UC, Berkeley Notes
Scalable, High Perf. Interconnection
Network
• At Core of Parallel Computer Arch.
• Requirements and trade-offs at many levels
–
–
–
–
Elegant mathematical structure
Deep relationships to algorithm structure
Managing many traffic flows
Electrical / Optical link properties
• Little consensus
–
–
–
–
Scalable
Interconnection
Network
interactions across levels
Performance metrics?
Cost metrics?
Workload?
network
interface
=> need holistic
understanding
CA
M
5/25/2016
CS258 S99
CA
P
M
P
11
Goals
Job of a multiprocessor network is to transfer
information from source node to destination
node in support of network transactions that
realize the programming model
• latency as small as possible
• as many concurrent transfers as possible
– operation bandwidth
– data bandwidth
• cost as low as possible
5/25/2016
CS258 S99
12
Formalism
• network is a graph V = {switches and nodes}
connected by communication channels C  V  V
• Channel has width w and signaling rate f = 1/t
– channel bandwidth b = wf
– phit (physical unit) data transferred per cycle
– flit - basic unit of flow-control
• Number of input (output) channels is switch
degree
• Sequence of switches and links followed by a
message is a route
• Think streets and intersections
5/25/2016
CS258 S99
13
What characterizes a network?
• Topology
(what)
– physical interconnection structure of the network graph
– direct: node connected to every switch
– indirect: nodes connected to specific subset of switches
• Routing Algorithm
(which)
– restricts the set of paths that msgs may follow
– many algorithms with different properties
» gridlock avoidance?
• Switching Strategy
(how)
– how data in a msg traverses a route
– circuit switching vs. packet switching
• Flow Control Mechanism
(when)
– when a msg or portions of it traverse a route
– what happens when traffic is encountered?
5/25/2016
CS258 S99
14
Topological Properties
• Routing Distance - number of links on route
• Diameter - maximum routing distance between
any two nodes in the network
• Average Distance – Sum of distances between
nodes/number of nodes
• Degree of a Node – Number of links connected to
a node => Cost high if degree is high
• A network is partitioned by a set of links if their
removal disconnects the graph
• Fault-tolerance – Number of alternate paths
between two nodes in a network
5/25/2016
CS258 S99
15
Typical Packet Format
Routing
and
Control
Header
Data
Payload
Error
Code
Trailer
digital
symbol
Sequence of symbols transmitted over a channel
• Two basic mechanisms for abstraction
– encapsulation
– fragmentation
5/25/2016
CS258 S99
16
Review: Performance Metrics
Sender
Sender
Overhead
Transmission time
(size ÷ bandwidth)
(processor
busy)
Time of
Flight
Transmission time
(size ÷ bandwidth)
Receiver
Overhead
Receiver
Transport Latency
(processor
busy)
Total Latency
Total Latency = Sender Overhead + Time of Flight +
Message Size ÷ BW + Receiver Overhead
Includes
header/trailer in BW calculation?
5/25/2016
CS258 S99
17
Switching Techniques
• Circuit Switching: A control message is sent from source to
destination and a path is reserved. Communication starts. The
path is released when communication is complete.
• Store-and-forward policy (Packet Switching): each
switch waits for the full packet to arrive in switch
before sending to the next switch (good for WAN)
• Cut-through routing or worm hole routing: switch
examines the header, decides where to send the
message, and then starts forwarding it immediately
– In worm hole routing, when head of message is blocked, message
stays strung out over the network, potentially blocking other
messages (needs only buffer the piece of the packet that is sent
between switches). CM-5 uses it, with each switch buffer being 4
bits per port.
– Cut through routing lets the tail continue when head is blocked,
storing the whole message into an intermmediate switch. (Requires
a buffer large enough to hold the largest packet).
5/25/2016
CS258 S99
18
5/25/2016
CS258 S99
19
Store and Forward vs. Cut-Through
• Advantage
– Latency reduces from function of:
number of intermediate switches X by the size of the packet
to
time for 1st part of the packet to negotiate the switches
+ the packet size ÷ interconnect BW
5/25/2016
CS258 S99
20
Store&Forward vs Cut-Through Routing
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
0
•
h(n/b + D)
vs
• what if message is fragmented?
• wormhole vs virtual cut-through
5/25/2016
CS258 S99
n/b + h D
21
5/25/2016
CS258 S99
22
Example
Q. Compare the efficiency of store-and-forward (packet
switching) vs. wormhole routing for transmission of a 20 bytes
packet between a source and destination, which are d-nodes
apart. Each node takes 0.25 microsecond and link transfer rate
is 20 MB/sec.
Answer: Time to transfer 20 bytes over a link = 20/20 MB/sec = 1
microsecond.
Packet switching: # nodes x (node delay + transfer time)= d x (.25 +
1) = 1.25 d microseconds
Wormhole: (# nodes x node delay) + transfer time
= 0.25 d + 1
Book: For d=7, packet switching takes 8.75 microseconds vs. 2.75
microseconds for wormhole routing
Contention
• Two packets trying to use the same link at same
time
– limited buffering
– drop?
• Most parallel mach. networks block in place
– link-level flow control
– tree saturation
• Closed system - offered load depends on delivered
5/25/2016
CS258 S99
24
Delay with Queuing
Suppose there are L links per node. Each link sends a packet
to another link at “Lamda” packets/sec. The service rate
(link+switch) is “Mue” packets per second. What is the
delay over a distance D?
Ans: There is a queue at each output link to hold extra packets.
Model each output link as an M/M/1 queue with LxLamda input
rate and Mue service rate.
Delay through each link = Queuing time + Service time = S
Delay over a distance D = S x D
5/25/2016
CS258 S99
25
Congestion Control
• Packet switched networks do not reserve bandwidth; this
leads to contention (connection based limits input)
• Solution: prevent packets from entering until contention
is reduced (e.g., freeway on-ramp metering lights)
• Options:
– Packet discarding: If packet arrives at switch and no room in buffer,
packet is discarded (e.g., UDP)
– Flow control: between pairs of receivers and senders;
use feedback to tell sender when allowed to send next packet
» Back-pressure: separate wires to tell to stop
» Window: give original sender right to send N packets before getting
permission to send more; overlaps latency of interconnection with
overhead to send & receive packet (e.g., TCP), adjustable window
– Choke packets: aka “rate-based”; Each packet received by busy switch
in warning state sent back to the source via choke packet. Source
reduces traffic to that destination by a fixed % (e.g., ATM)
5/25/2016
CS258 S99
26
Routing
• Recall: routing algorithm determines
– which of the possible paths are used as routes
– how the route is determined
– R: N x N -> C, which at each switch maps the destination node
nd to the next channel on the route
• Issues:
– Routing mechanism
» arithmetic
» source-based port select
» table driven
» general computation
– Properties of the routes
– Deadlock feee
5/25/2016
CS258 S99
27
Routing Mechanism
• need to select output port for each input packet
– in a few cycles
• Simple arithmetic in regular topologies
– ex: Dx, Dy routing in a grid
» west (-x)
Dx < 0
» east (+x)
Dx > 0
» south (-y)
Dx = 0, Dy < 0
» north (+y)
Dx = 0, Dy > 0
» processor
Dx = 0, Dy = 0
• Reduce relative address of each dimension in
order
– Dimension-order routing in k-ary d-cubes
– e-cube routing in n-cube
5/25/2016
CS258 S99
28
Routing Mechanism (cont)
P3
P2
P1
P0
• Source-based
–
–
–
–
message header carries series of port selects
used and stripped en route
CRC? Packet Format?
CS-2, Myrinet, MIT Artic
• Table-driven
– message header carried index for next port at next switch
» o = R[i]
– table also gives index for following hop
» o, I’ = R[i ]
– ATM, HPPI
5/25/2016
CS258 S99
29
Properties of Routing Algorithms
• Deterministic
– route determined by (source, dest), not intermediate state (i.e.
traffic)
• Adaptive
– route influenced by traffic along the way
• Minimal
– only selects shortest paths
• Deadlock free
– no traffic pattern can lead to a situation where no packets
mover forward
5/25/2016
CS258 S99
30
Deadlock Freedom
• How can it arise?
– necessary conditions:
» shared resource
» incrementally allocated
» non-preemptible
– think of a channel as a shared
resource that is acquired incrementally
» source buffer then dest. buffer
» channels along a route
• How do you avoid it?
– constrain how channel resources are allocated
– ex: dimension order
• How do you prove that a routing algorithm is
deadlock free
5/25/2016
CS258 S99
31
Proof Technique
• resources are logically associated with channels
• messages introduce dependencies between
resources as they move forward
• need to articulate the possible dependences that
can arise between channels
• show that there are no cycles in Channel
Dependence Graph
– find a numbering of channel resources such that every legal
route follows a monotonic sequence
• => no traffic pattern can lead to deadlock
• network need not be acyclic, on channel
dependence graph
5/25/2016
CS258 S99
32
Example: k-ary 2D array
• Theorem: x,y routing is deadlock free
• Numbering
–
–
–
–
+x channel (i,y) -> (i+1,y) gets i
similarly for -x with 0 as most positive edge
+y channel (x,j) -> (x,j+1) gets N+j
similarly for -y channels
• any routing sequence: x direction, turn, y
direction is increasing
1
2
00
01
2
17
18
10
17
1
03
0
11
12
13
21
22
23
31
32
33
19
16
CS258 S99
02
18
20
5/25/2016
3
30
33
Deadlock free wormhole networks?
• Basic dimension order routing techniques don’t work for kary d-cubes
– only for k-ary d-arrays (bi-directional)
• Idea: add channels!
– provide multiple “virtual channels” to break the dependence cycle
– good for BW too!
Input
Ports
Output
Ports
Cross-Bar
– Do not need to add links, or xbar, only buffer resources
• This adds nodes the the CDG, remove edges?
5/25/2016
CS258 S99
34
Breaking deadlock with virtual
channels
Packet switches
from lo to hi channel
5/25/2016
CS258 S99
35
Adaptive Routing
• R: C x N x S -> C
• Essential for fault tolerance
– at least multipath
• Can improve utilization of the network
• Simple deterministic algorithms easily run into bad
permutations
• fully/partially adaptive, minimal/non-minimal
• can introduce complexity or anomolies
• little adaptation goes a long way!
5/25/2016
CS258 S99
36