Building an Elastic Buffer

Download Report

Transcript Building an Elastic Buffer

Elastic-Buffer Flow-Control
for On-Chip Networks
George Michelogiannakis,
James Balfour, William J. Dally
Computer Systems Laboratory
Stanford University
Edited by: Abhay Bhopat
Background
 Buffer
 Elastic Buffer
 Elastic Buffer design
2
Introduction
 Elastic-buffer (EB) flow-control uses the channels
as distributed FIFOs
• Input buffers at routers are not needed
 Can provide 12% more throughput per unit power
 Reduces router cycle time by 18%
• Compared to VC routers
3
Outline
 Building elastic-buffered channels
• By using what is already there
 Router microarchitecture
 Deadlock avoidance
 Load-sensing for adaptive routing
 Evaluation
4
The Idea
 Use the network channels as distributed FIFOs
 Use that storage instead of input buffers at
routers
• To remove input buffer area and power costs
Pipelined channel
Channel as FIFO
5
Building an Elastic Buffer
 To build an EB in a pipelined channel with
master-slave flip-flops (FFs):
 Use latches for storage by driving their enables
independently
Elastic buffer
Master-slave FF
6
Expanded view of EB control logic
7
How Elastic Buffer Channels Work
 Ready/valid handshake between elastic buffers
• Ready: At least one free storage slot
• Valid: Non-empty (driving valid data)
Cycle 6
1
2
3
4
5
8
Control Logic Area Overhead
 Control logic is implemented as a four-state FSM
with 10 gates and 2 FFs
• Cost is amortized over channel width
 Example: control logic increases
area of a 64-bit channel by 5%
9
Outline
 Building elastic-buffered channels
 Router microarchitecture
• Use EB flow-control through the router
 Deadlock avoidance
 Load-sensing for adaptive routing
 Evaluation
10
Use EB Flow-Control Through the Router
VC input-buffered
router
Three-slot
VC & SW output
Input
buffer
EB
cover
for
allocators
removed.
LAto
routing
also
replaced by
arbitration
Per-output
arbiters
applicable done
to EB
input
EB
one
cycle in
instead.
networks.
advance.
EB router
11
Topology
2D 4x4 FBFly
12
Separate routers for networks
13
Outline
 Building elastic-buffered channels
 Router microarchitecture
 Deadlock avoidance
• How to provide isolation without VCs
 Load-sensing for adaptive routing
 Evaluation
14
Deadlock Avoidance: Duplicate Channels
 No input buffers
no virtual channels
 Three types of possible deadlocks:
1. Protocol deadlock
2. Cyclic flit dependency in network
 Solution: Duplicate physical channels
15
Deadlock Avoidance: No Interleaving
3. Interleaving deadlock
• New head flits require destination registers
• Occupied destination registers depend on tail flits
• Tail flits cannot bypass the new head flit
 Solution: Disallow packet interleaving
16
Duplicating Channels Between Routers
 Duplicate channels with neckdown
• Small improvement (still one switch port), large cost
 Duplicate channels with duplicate switch ports
• Excessive cost (switch quadratic cost)
17
Dividing Into Sub-Networks More Efficient
 Divide into sub-networks
• Double bandwidth, double the cost
• However, when narrowing datapath down to normalize
for throughput or power
more beneficial
• Again, due to switch quadratic cost
18
Outline
 Building elastic-buffered channels
 Router microarchitecture
 Deadlock avoidance
 Load-sensing for adaptive routing
• Propose a load metric for EB networks
 Evaluation
19
Congestion metrics
 Blocked Cycles
 Blocked Ratio
 Output Occupancy
 Channel Occupancy
 Channel Delay
20
Output Channel Occupancy Load Metric
 Flit-buffered networks use credit count
 EB networks measure output channel occupancy
• At a certain segment of the output channel (shown in red)
• Occupancy decremented when flits leave that segment
• Incremented by a packet’s length when routing decision is
made. Packets see other decisions in same cycle
21
Outline
 Building elastic-buffered channels
 Router microarchitecture
 Deadlock avoidance
 Load-sensing for adaptive routing
 Evaluation
• Compare throughput, power, area, latency, cycle time
22
Evaluation Methodology
 Used a modified version
 Area/power estimations from a 65nm library
• Input buffers modeled as SRAM cells
• Throughput/power optimal # of VCs and buffer depth
• Two sub-networks: request and reply
 Averaged over a set of 6 traffic patterns
 Constant packet size (512 bits)
 Swept channel width from 28 to 192 bits
23
Throughput-Power Gains in 2D Mesh
Throughput gain
EB network improvement:
Same power: 10%
increased throughput
Same throughput: 12%
reduced power
24
Throughput-Area Gains in 2D Mesh
2% improvement
for EB networks
25
Latency-Throughput in 2D Mesh
Zero-load latency equal
26
Power Breakdown: No Input Buffer Power
Mesh low-swing power breakdown (2% packet injection rate)
Output clock
Output FF
EBN
Crossbar control
Crossbar power
Input buffer write
Input buffer read
Channel FF
VC-Buff
Channel clock
Channel traversal
0
0.2
0.4
0.6
0.8
(W)
27
Area Breakdown: No Input Buffer Area
Low-swing mesh area breakdown
1.2
1.0
0.8
0.6
0.4
0.2
0.0
(mm2)
VC-Buff
Channel
EBN
Switch
Input
Output
28
Router RTL Implementation
 No buffers, VCs, allocators, credits
• VC router had look-ahead routing
 Buffers: FF arrays. 2 VCs, 8 slots each
45nm, LP-CMOS, worst-case
Mesh 5x5 routers. DOR. 64-bit datapath
Aspect
VC router
EB router
Savings
Area (μm2)
63,515
14,730
77%
Clock (ns)
3.3
2.7
18%
Power (mW)
2.59
0.12
95%
29
Conclusions
 EB flow-control uses channels as distributed FIFOs
• Removes input buffers from routers
• Uses duplicate physical channels instead of VCs
 Increases throughput per unit power up to 12%
for low-swing
• Depends on what fraction of the overall cost input buffers
constitute
 Reduces router cycle time by 18%
 Flow-control choice depends on design parameters
and priorities
30
Thanks for your
attention
Questions?