Document 7847750

Download Report

Transcript Document 7847750

Virtual Circuit Tree Multicasting:
A Case for On-Chip Hardware
Multicast Support
Natalie Enright Jerger,
Li Shiuan Peh, and Mikko Lipasti
University of Wisconsin – Madison and Princeton University
Executive Summary

Demonstrate necessity of multicasting on-chip



State of the art router insufficient
Significant number of proposals could leverage
multicasting
Provide efficient multicasting solution using
Virtual Circuit Trees

Overlay logical routing trees on mesh network


6/24/2008
Reduces interconnect latency by up 90%
Reduces switching activity by up to 53%
Enright Jerger - ISCA 2008
2
Packet-Switched Unicast Router

3 stage packet-switched router

Based on most aggressive recent proposals
Buffer Write
Virtual
Switch
Channel/
Traversal
Switch
Allocation
Router
Router


6/24/2008
Link
Switch
Traversal
Traversal
Link
Traversal
Link
Link
Aggressive baseline not well matched all
types of communication
Multicast is performed using multiple unicasts
Enright Jerger - ISCA 2008
3
State-of-the-Art Router
Interconnect Latency
20
18
16
No MC
14
MC (1%)
12
MC (5%)
10
MC (10%)
8
6
0%
10%
20%
30%
40%
50%
Network Load (% of Link Capacity)

Current router architecture poorly equipped to
handle even a low amount of multicast (MC) traffic
6/24/2008
Enright Jerger - ISCA 2008
4
Outline


Motivation
VCTM Implementation




Multicasting Scenarios




6/24/2008
Baseline router problems
Example
Architecture
Description
Characterization
Evaluation
Conclusion
Enright Jerger - ISCA 2008
5
Baseline Router Example
A
VCs
1A
2A
1B
2B
B
VCs resourcesVCs
More
to solve Busy
this VCs
problem?
X
1C
2C
1D
2D
More buffers, virtual channels, links?
VCs
C
6/24/2008
Enright Jerger - ISCA 2008
D
6
Key Router Problems
Redundant (wasteful) use of
Injection Bandwidth:
resources:
payload occupying
Asame
B
Burst of messages
at network
interface
extra buffers, links
VCs
2A
2B
VCs
1A
2C
1C
2D
X
VCs
Busy
Alternative
routing:
1B
Improve
1D throughput,
but wastes power
VCs
VCs
Speculation Problems:
predicated on low loads
Burst of messages
6/24/2008
C
Enright Jerger - ISCA 2008
D
7
Virtual Circuit Tree Multicasting
Overview

Builds on existing state-of-the-art router


Unicast performance is not impacted
Build
Fewer packets
multicast
trees incrementally
Tree improves
reuse is speculation
necessary
M: <East,
M:
<East>
South>
M: <Eject,
M:
<Eject>
South>
1
M: <East>
Multicast from 0 to <2,4,5>
for effectiveness
2
Build Tree Incrementally (Tree M)
0
1
2
M
C
B
A
M
M
Significant
temporal destination set
3 Unicast
Setup Packets
(1 per destination)
reuse acrossLink
all scenarios
Redundancy
problem
3 Packets Injected Injection
into Network
Removed
solved
A
2
3
4
5
3
4
B
4
C
5
6/24/2008
M: <Eject>
Enright Jerger - ISCA 2008
M: <Eject>
8
VCTM Router Architecture
Virtual Circuit Tree Table
Src
VCTnum
Id
Ej
N
S
E
W
Fork
Virtual Channel Allocator
.
.
.
0
1
0
1
Switch Allocator
1
0
3
VC 0
VC 0 VC 0
Input
Ports
0
MVC MVC
0
VC x
VC 0
VC x
VC x
6/24/2008
MVC 0
Enright Jerger - ISCA 2008
9
Implementation Details (1)

Destination Set Content Addressable Memory
1
1
5
4
0
0
0
1
1
0
1
2
1
3
0

6/24/2008
0
1
0
0
1
0
0
1
1
0
0
1
0
1
0
0
0
1
0
1
0
Destination Set <5,4,2>
2
Encode Tree ID
2 into multicast
header
If not present  replace oldest tree  perform
setup
Enright Jerger - ISCA 2008
10
Implementation Details (2)

VCTs provide routing not resources

Multicast arbitration same as unicast


Multiple arbitration steps at tree branch


6/24/2008
VCTs do not pre-allocate resources
If one desired output is blocked, other tree branch
outputs can still proceed
Longer buffer occupancy
Enright Jerger - ISCA 2008
11
VCTM Overhead

Virtual Circuit Tree Routing Tables
Number of Entries
Area (mm2)
Energy (nJ)
512
0.024
0.002
1024
0.041
0.002
2048
0.078
0.003

Destination Set CAMs
Number of Entries
Area (mm2)
Energy (nJ)
32
0.018
0.007
64
0.021
0.010
128
0.029
0.017

6/24/2008
Access Time < 1 cycle
Enright Jerger - ISCA 2008
12
Outline


Motivation
VCTM Implementation




Multicasting Scenarios




6/24/2008
Baseline router problems
Example
Architecture
Description
Characterization
Evaluation
Conclusion
Enright Jerger - ISCA 2008
13
Multicasting Scenarios (1)

Token Coherence [Martin, 2003]

TokenB: Broadcast for tokens



SGI Origin Directory Protocol [Laudon, 1997]


Multicast invalidate requests
Opteron Protocol [Conway, 2007]

Coherence requests sent to ordering point and
broadcast to all cores

6/24/2008
1 Token to read
All Tokens to write
Some filtering of destinations
Enright Jerger - ISCA 2008
14
Multicasting Scenarios (2)

Region Multicasting

Two level protocol



TRIPs [Sankaralingam, 2003]


Operand network
Multicast results of instructions to tiles containing
dependent instructions

6/24/2008
1st level: Multicast to sharers of address region
2nd level: Fall back on directory when no region
information available
35% of dynamic instructions have 2 or more future
uses
Enright Jerger - ISCA 2008
15
Multicasting Scenarios (3)

Uncorq [Strauss, 2007]


Virtual Hierarchies [Marty, 2007]



1st level directory
2nd level global broadcast
Dynamic NUCA caches [Kim, 2002]

6/24/2008
Unordered broadcast, ordered response
network
Multicast for cache hit
Enright Jerger - ISCA 2008
16
Characterizing Multicasts
100%
Up to 13% of traffic is multicast
100
90
80
70
60
50
40
30
20
10
0
80%
100
150
Number of Unique Destination Sets
6/24/2008
Enright Jerger - ISCA 2008
15,16
11,14
7,10
3,6
2
1
Token
50
Opteron
0
Directory
Token 60%
VCTM is an inexpensive
solution to
Opteron
40%
support
TRIPS multicasting
Region
RegionMulticast:
and
Directory:
Region 20%
Wide
variety
ofvariety
sizes of
Much
larger
Directory
0%
destination sets
Region
Multicast Coverage

Unique Destination Sets: combination of
destinations
multicast
Token:
TRIPs and
1in
destination
Directory:
TokenB and Opteron:
Small
setof
for
destination
each node
sets
Large destination sets
Number
Destinations
per multicast
TRIPS

17
Simulation Methodology

Network traffic from 5 different scenarios

Detailed network simulator


Flexible, lightweight VCTM mechanism
provides improvement for diverse scenarios

6/24/2008
Cycle-accurate modeling of router stages
Many more results in paper
Enright Jerger - ISCA 2008
18
Network Configuration
Topology
4-ary 2-mesh
5-ary 2-mesh (TRIPs)
Routing
Dimension Order: X-Y Routing
Channel Width
16 Bytes
Packet Size
1 flit (Coherence request = Address + Command)
5 flits (Data)
3 flits (TRIPs)
Virtual Channels
4
Buffers per port
24
Router ports
5
Virtual Circuit Trees
Varied from 16 to 4K
(1 to 256 VCTS/core)
6/24/2008
Enright Jerger - ISCA 2008
19
Power Savings
On-chip networks consume up to ~36% of chip
power [Wang, 2002]
Links, buffers and crossbars consume nearly 100% of
network power
Power saved through activity reduction

60
50
40
30
20
10
0
Directory
6/24/2008
TokenB
Region
Enright Jerger - ISCA 2008
TRIPs
Crossbar
Buffer
Link
Crossbar
Buffer
Link
Crossbar
Buffer
Link
Crossbar
Buffer
Link
Crossbar
4096
16
Buffer
% Usage Reduction

Link

Opteron
20
Performance Results Summary
Normalized Interconnect
Latency
SPECweb:
12%
Art: 55%
1,2
1
0,8
TPC-H: 68%
0,6
0,4
0,2
Directory
TRIPs
Opteron
Region Multicast
TokenB
0
0
16
32
64
128
512 2048 4096
Number of Virtual Circuit Trees


Small number of trees required for majority of benefit
Performance improvement depends on network pressure
6/24/2008
Enright Jerger - ISCA 2008
21

20
18
16
14
12
10
8
6
4
2
0
Opteron (FMM)
TRIPS (bzip2)
Region (TPC-H)
Token (Barnes)
Wide Injection
Port + Infinite
VCs + Adaptive
Routing
Dir (specWEB)
Interconnect Latency
VCTM vs. Aggressive Network
VCTM w/ 512
Trees
VCTM outperforms aggressive (unrealistic) network
6/24/2008
Enright Jerger - ISCA 2008
22
VCTM Summary (1)

Improves performance across a variety of
scenarios



Small number of trees necessary


6/24/2008
Reduces interconnect latency by up 90%
Reduces switching activity by up to 53%
8 trees/core achieves substantial benefit
Dynamic table partitioning could further reduce
total tree storage
Enright Jerger - ISCA 2008
23
VCTM Summary (2)



Outperforms aggressive router
No impact on unicast performance
Integrates with existing state-of-the-art
router architecture


6/24/2008
Easily extendable to more scalable topologies
and routing algorithms
Open door for new optimizations
Enright Jerger - ISCA 2008
24
Thank you

6/24/2008
Questions
Enright Jerger - ISCA 2008
25