A Novel 3D Layer-Multiplexed On

Download Report

Transcript A Novel 3D Layer-Multiplexed On

Design of a High-Throughput
Distributed Shared-Buffer NoC Router
Rohit Sunkam Ramanujam*, Vassos Soteriou†,
Bill Lin*, Li-Shiuan Peh‡
*Dept. of Electrical Engineering, UCSD, USA
†Dept. of Electrical Engineering, CUT, Cyprus
‡Dept. of Electrical Eng. and Computer Science, MIT, USA
1
Chip Multiprocessors are a reality …
Chip Multiprocessor
Sources: Intel Inc.
and Tilera Inc.
Uniprocessor
2
The need for a Network on Chip (NoC)
Compute Unit
Router
•
•
•
•
Scalable communication
Modular design
Efficient use of wires
A new way to organize and build VLSI systems
3
The Problem – Delivering high throughput in NoCs
• Why Care?
– NoCs in CMPs connect general-purpose processors.
– Future applications unknown → traffic unknown.
– Exploiting parallelism needs fine-grained interaction
between cores.
– Can expect high traffic volume for current and
future applications running on many-core
processors.
– E.g. Cache coherence between large number of
distributed shared L2 caches.
4
An important design choice that
affects throughput
• Router microarchitecture
– How well does a router multiplex packets onto its
output links?
5
NoC routers – Current design
Input Buffered Routers (IBRs) – Flits buffered at the input ports
cycle = 3
1
2
Output 1
Input 1
Output 2
Input 2
Crossbar
Maximal
MaximalMatching:
Matching:
Input
Input21→
→ Output
Output 11
Output 2 is unutilized in cycle 3 although there is a flit destined for output 2.
Bottleneck: Maximal matching used for arbitration is not good enough.
(70-80% efficiency)
6
Output queueing to the rescue …
Output buffered router (OBR) – Flits buffered at the output ports
cycle
cycle =
= 312
Output 1
Input 1
Output 2
Input 2
Crossbar
Output links are always utilized when there are flits available.
Better multiplexing of flits onto output links ⇒ higher throughput.
7
How much difference does it make?
Uniform Traffic
A throughput gap of 18%!
8
How much difference does it make?
Complement Traffic
A throughput gap of 12%!
9
How much difference does it make?
Tornado Traffic
A throughput gap of 22%!
10
Output Buffering is great …
• OBRs offer much higher throughput than IBRs.
• OBRs have predictable delay.
– Queuing delay modeled using M/D/1 queues.
• Packet delays not predictable for IBRs.
12
So why aren’t OBRs used in NoCs ?
Input 1
Output 1
Input 2
.
.
.
.
.
.
Input P-1
Output P-1
Crossbar
• Implementing Output Buffering requires either:
– Crossbar speedup of P, where P is the number of ports.
Not practical for aggressively clocked designs.
– Output buffers with P write ports and a PxP2 crossbar.
Has huge area and power penalties.
13
Our approach: Emulate Output
Queueing without any speedup
2
Current time = 4
1
3
5
6
Step1: Timestamp the flits
Assign a future time at which a
flit would depart the router
assuming output buffering.
Input 1
Input 2
Input 3
Step2: Find a conflict-free
middle memory.
Step3: Move flits from input
buffers to middle memories.
Step4: When current time == timestamp,
Read flit from middle memory to output
port.
4
Output 1
5
Output 2
6
Output 3
Crossbar 1
Middle Memories
Crossbar 2
14
Arrival and Departure Conflicts
• Arrival Conflicts – With P input ports, a flit can
have an arrival conflict with P-1 other flits.
• Departure Conflicts – With P output ports, a flit
can have a departure conflict with P-1 other
flits.
• By Pigeon hole principle, 2P-1 middle memories
needed to avoid all arrival and departure
conflicts.
15
The Distributed Shared-Buffer Router
(DSB)
• Aims at emulating the packet servicing scheme of an
OBR with limited buffers and no speedup.
– First-Come-First-Served servicing of flits.
Objectives:
– Close the performance gap between OBRs with infinite buffers
and IBRs (high throughput).
– Make a feasible design → low power and area overhead.
– Make packet delays more predictable for delay sensitive NoC
applications.
16
DSB Router
Innovations
– Router pipeline with new stages for:
• Timestamping flits
• Finding a conflict free middle memory
– Complexity and delay-balanced pipeline stages for a
high-clocked, high-performance implementation.
– New flow control to prevent packet dropping when
resources are unavailable.
– Evaluate power-performance tradeoff of DSB
architectures with fewer than 2P-1 middle memories.
17
Evaluation
• Cycle accurate flit level simulator.
• Mesh topology – Each router has 5 ports,
NSEW + Injection/Ejection.
• Dimension Ordered Routing (DOR) – decouple
effects of routing algorithm on network
performance.
18
Evaluation – Traffic traces
• 3 Synthetic traffic traces:
– Uniform
– Bit Complement (Complement)
– Tornado
• Real traffic/memory traces from running
multiple threads (49 threads ⇒ 7x7 Mesh) of
eight SPLASH-2 benchmarks:
– Complex 1D FFT, LU decomposition, Waternsquared, Water-spatial, Ray tracer, Barnes-Hut,
Integer Radix sort, Ocean simulation.
19
Performance on Uniform traffic
A throughput gap of just 9%
20
Performance on Complement traffic
A throughput gap of just 4%
21
Performance on Tornado traffic
A throughput gap of just 8%
22
Performance of DSB on SPLASH-2 benchmarks
Small
Performance
difference
Huge performance
in
of packet
DSB and
islatency
improvements
very
close
between
totraces
an
over
OBR
IBR
and
with
in DSB
traces
same
number
exhibiting
is of
mainly
high
pipeline
due to
Raytrace,
Barnes
Ocean
have
veryrouters
little
contention.
the has
limited
and
buffering
demanding
stages.
in thehigh
DSB
bandwidth.
router.
For these traces, contention
IBR
lower
latency
because
of a shorter pipeline.
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
IBR200
72%
DSB200
OBR-5stage
64%
97%
23
Input Buffered Router (IBR) pipeline
RC
VA
SA
ST
LT
Input 1
utput 1
Input 2
Output 2
Crossbar
Switch
Switch
Link
Traversal
Arbitration
Traversal
Virtual
Channel
Allocation
Route
Computation
Traverse
Acquire
the
access
link Virtual
tothe
reach
toof
crossbar
the
the
output
to(buffering)
port
reach
buffer
through
the
of at
output
the
the
next
crossbar.
link.
hop
router.
Reserve
anTraverse
output
Channel
next
hop
router.
Determine
the
port
the
flitinput
based
on
the
destination
coordinates.
24
Distributed Shared-Buffer Router pipeline
If CR or VA fails
RC
TS
CR
VA
XB1 +
MM_WR
MM_RD + XB2
LT
Input 1
Output 1
Input 2
Output 2
Crossbar 1
Middle Memory
Crossbar 2
Conflict
Middle
Crossbar
Resolution
Timestamp
Memory
Route
1Link
+ +Middle
Computation
Traversal
Virtual
Read
Allocation
Memory
+Channel
Crossbar
Write
Allocation
2
FlitWhen
Flit
traverses
Determine
traverses
theConflict
Assign
current
the
the
the
first
output
aoutput
time
Resolution:
timestamp
crossbar
equals
port
linkof
and
toto
the
reach
Find
agets
flit
timestamp,
flitabased
for
the
written
conflict
the
input
on
output
into
the
free
the
buffer
the
flit
middle
destination
port
isassigned
ofread
requested.
the
memory.
next-hop
from
coordinates.
middle
the middle
router.
memory.
Timestamp
Virtual Channel
is the Allocation:
future
memory
timeand
(cycle)
Reserve
traverses
at awhich
virtual
thethe
second
channel
flit can
crossbar.
at
depart
the input
the middle
of the next
memory
hop
router.
buffer.
25
Higher throughput – At what cost?
Extra power !!
RC
TS
CR
VA
XB1 +
MM_WR
MM_RD + XB2
LT
Input 1
Output 1
Input 2
Output 2
Crossbar 1
Middle Memory
Crossbar 2
Two
Middle
crossbars
memory
instead
buffers
of–one:
Can
With
have
Nfewer
middle
input
memories,
buffers
compensate
one PxN for
and
TS stage
instead
Extra
stage
of Switch
for Conflict
Arbitration
Resolution
in IBRstoneed
extra middle
one PxN
memory
crossbar.
buffers.
26
Power-Performance tradeoff
• Theoretically, 2P-1 middle memories needed
to resolve all conflicts.
• For a 5-port mesh router, need > 9 middle
memories, a 5x9 and a 9x5 crossbar – large
power overhead.
• What is the impact of using fewer than 2P-1
middle memories?
27
Power and Area Comparison
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
DSB
Normalized Tile Power
IBR
1.2
IBR
DSB
1
0.8
0.6
0.4
0.2
0
Normalized
Power
Normalized
Area
10%
15%
20%
NoC power as a percentage of Tile power
Router
If NoCpower
consumes
overhead
10% of 50%
20%
tile power,
for DSB-5
tile power
routeroverhead of only
3.5%
7% for
forDSB-5
DSB-5router
router
28
Thank you
• Questions?
29