On-chip RF Interconnect for NoC Applications

Download Report

Transcript On-chip RF Interconnect for NoC Applications

RF-Interconnect for
Communications On-Chip
Frank Chang1, Jason Cong2, Adam Kaplan2,
Mishali Naik2, Glenn Reinman2
Eran Socher1, Rocco Tam1
Department of Electrical Engineering1
Department of Computer Science2
Current Trend in CMP - NoC
ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip
in 65nm CMOS (Sriram Vangal et al., Intel)
• 65nm CMOS 80 tile NoC
• 10X8 2D mesh networkon-chip running @ 4GHz
• Bisection bandwidth
256GB/s
• 1 TFLOPS @ 1V about
98W
What is The Challenge?
• Cores would keep shrinking in size but
maintain the same operation frequency
(2~4GHz) due to thermal constraints
• More cores would be integrated on the
same chip to achieve performance boost
through parallelism
• Performance would be limited by the
communication efficiency between cores
and memories on- and off-chip
The Scaling Trend
• Scaling reduces delay of logic gates but not wires
100
90
80
70
60
50
40
30
20
10
0
FO4
1mm RC global wire
Technology Node
32
nm
45
nm
65
nm
90
nm
Repeated 1mm RC global
wire
13
0n
m
18
0n
m
Delay [ps]
Transistor and Wire Delay Trend in CMOS
Traditional Interconnect
Signal Power
• Units communicate through a parallel bus using
voltage signaling (charging and discharging the wire
capacitance)
• Latency is RC limited (~L2)
• Using CMOS repeaters reduces latency (~L) but does
not benefit from scaling
• Supply no longer scales due to leakage
• Baseband-only signaling requires extensive
equalization
• Waste of broad bandwidth available from modern
CMOS devices (ft>150GHz, fmax>250GHz)
Available Bandwidth
fT
10
Baseband
Signal
fT
Major Interconnect Issues
•
•
•
•
Latency is large across chip
Bandwidth is RC limited (~1Gbps/wire)
Communication pattern is fixed
Energy consumption is high and not
scalable (~10pJ/bit)
• Future microprocessors may encounter
communication congestion and most of the
energy will be spent on “talking” instead of
computing
How Can RF Help?
• EM waves travel at the (effective) speed
of light (~10ps/mm)
• Carrier frequencies can be modulated by
modern CMOS with high data rates
• Transmission lines on- or off-chip can
guide the waves (RF modulated data)
from the transmitter to receiver with
recoverable attenuation
RF-Interconnect Concept
Mixer
Output Buffer
Mixer
datain
Transmission Line
dataout
f0
frequency
Signal Power
Datain
f0
Signal Power
Signal Power
LPF
Transmitted
Signal
f0
frequency
dataout
frequency
• Data transmit through transmission lines at the speed of light, with
less dispersion across the band and less baseband interference
• data rate is only limited by CMOS mixer modulation speed
RF-I using Multi-band FDMA
Data1
Mixer
Mixer
Output Buffer
LPF
Signal Power
Signal Power
• More bands are used with same modulation speed at
each band
• Higher aggregate data rates can be achieved on the
same transmission line
frequency
Data1
Data2
f1
f1
Signal Power
Signal Power
frequency
frequency
Data2
Signal Power
f3
Data3
frequency
f3
f1
frequency
f2
Signal Power
frequency
Data4
Transmission Line
f2
f2
f3
f4
frequency
Signal Power
Data3
Signal Power
Signal Power
frequency
Data4
frequency
f4
f4
3.6Gbps Multi-drop Multiband Bi-directional RF-I *
* World’s 1st Multiband RF-I, Ko & Chang, 2005 ISSCC
Can We Implement RF-I in
CMOS?
• Today’s RF-CMOS circuits are in the
wireless communication “sweet spots” of
500MHz-5GHz
– Insufficient bandwidth for RF-I to be
effective!
• Millimeter-wave CMOS circuits have
been developed for 60GHz and recently
for 324 GHz bands
CMOS 324GHz Generator

-76dBm before
calibration

-46dBm after
calibration
*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using
Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC
Mixer
Output Buffer
Mixer
10GHz
20GHz
30GHz
40GHz
50GHz
60GHz
LPF
Signal Power
Data1
Signal Spectrum
Signal Power
Frequency Generation in
Multiband RF-Interconnect
f
frequency
Data1
frequency
10GHz
10GHz
X6
RX
Transmission Line
Data6
Signal Power
Signal Power
X6
TX
f2 = 20GHz
frequency
Data6
frequency
60GHz
f3 = 30GHz
f1 = 10GHz
f4 = 40GHz
f5 = 50GHz
f6 = 60GHz
Multi-Band
Synthesizer
60GHz
Simultaneous Sub-harmonic Injection Locked
mm-Wave Frequency Generation
Non-linear Harmonic Slave VCOs
Generator
• Using sub-harmonic
injection-locked
VCOs simultaneous
lock to one single
reference frequency
• Advantages:
– Eliminate PLLs
– Low Power
Consumption
– Small Area
Master VCO
Sub-harmonic Injection Locked
VCO*
VCC,1V
Out
buffer
Outf
3f
Out
buffer
Out+
M1
3f
M2
Ibias,VCO
f
+
Vinj
-
M3
M4
Ibias,inj
•
•
•
•
LC-based VCO core
Differential pair for odd harmonic generation
Single-ended even harmonic generation
Injection locking to high harmonic within
locking range of the VCO
This Work*
Process
Free Running
Frequency
(GHz)
Max
locking
Range
(GHz)
Locking Harmonics
Power (mW)
90nm CMOS
29.3
5.6
2nd,4th, 6th, 8th
3rd, 5th, 7th
4
*Sai-Wang Tam, M.-C. Frank Chang, etc…, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency
Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008
RF-I using Amplitude shift-Key
(ASK) Modulation
• TX: Use transformer couples output of VCO to ASK modulator
and use simple modulator to generate RF signal in ASK.
• RX: Use self-mixer for envelope detection. Afterwards a simple
buffer and Schmitt Trigger recover the signal to rail-to-rail
swing.
Differential Transmission Line
• Loss of 0.6-1.6 dB/mm
Differential
TML
RF-I using Amplitude Shift-Key
(ASK) Modulation
VCO Output: 60GHZ
ASK modulated Signal
Mixer output
5Gbit/s Data input
3DIC ASK RF-I Tested at 11Gbps*
Output Eye diagram
Output versus input
10ps/div
Input
50mV/div
500ps/div
Output
Coupling
Capacito
r
Die
Photo
TX in
Layer
2
RX in
Layer
1
*Gu and Chang,
pp.448-449, 2007
ISSCC
(0.33pJ/bit)
Single Channel ASK RF-I
Performance Summary
• Simple Architecture:
One TX VCO, One
Mixer, One RX Buffer
• No synchronization
circuits such as PLL or
clock data recovery
needed in ASK RF-I
• Can expand the same
architecture to multiband RF-I
Process
IBM 90nm CMOS
Digital Process
RF-Carrier Freq.
60GHz
Data Rate
5Gbit/s
Power
TX:2mW
RX: 3mW
Energy per bit
1pJ/Bit
Active Area
1300 µm2
Future Trends in Multi-band
ASK RF-I
Scaling in Energy per bit (pJ/bit)
TX/RX Area/Gbit
1200
RF-I
1000
1.00
Bus
0.90
Area um2
Energy per bit (pJ/bit)
1.10
0.80
0.70
800
600
400
0.60
200
0.50
0
0.40
90nm
65nm
45nm
32nm
90nm
22nm
Technology
65nm
45nm
32nm
22nm
Technology
Area/Gbit
Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2
(µm2/Gbit)
90nm
3RF + 1 BB
5
20
20
1.00
0.022
1100
65nm
4RF + 1 BB
6
30
25
0.83
0.0238
800
45nm
5RF + 1 BB
7
42
30
0.71
0.0228
540
32nm
6RF + 1 BB
8
56
35
0.63
0.0211
380
22nm
7RF + 1 BB
9
72
40
0.56
0.0193
260
21
Interconnect Topology Comparison
2cm Interconnect Data Rate Density
1600
1400
Latency [ps]
1200
1000
Bus
RF-I
Optical-I
800
600
400
200
0
90nm
65nm
45nm
32nm
Data Rate Density [Gbps/um]
2cm Interconnect Latency
14
12
10
Bus
RF-I
Optical-I
8
6
4
2
0
22nm
90nm
Technology Node
•
25
Energy [pJ/bit]
15
Bus
RF-I
Optical-I
10
•
5
32nm
22nm
22nm
Technology Node
•
Traditional RC parallel bus
RF-Interconnect
Optical Interconnect
As process technology scales toward
22nm…
–
–
–
0
45nm
32nm
Comparison across process technology
of…
–
–
–
20
65nm
45nm
Technology Node
2cm Interconnect Energy
90nm
65nm
RF-I has lowest latency
RF-I consumes least energy
RF-I has highest data rate density
RF-I is fully compatible with modern CMOS
technology
22
Advantages of RFInterconnects
•
•
•
•
Latency
Bandwidth
Energy
Reconfigurability
Example: RF-I for CMP NoC Design
C
C
R
C
C
R
C
R
C
C
R
C
R
C
C
C
R
C
C
C
R
R
C
R
C
R
$
C
C
R
$
R
C
R
C
R
$
C
R
C
R
R
C
R
C
R
C
R
C
R
R
R
C
C
R
C
C
R
R
C
R
R
C
R
R
C
$
R
C
C
R
R
C
R
R
$
R
R
C
$
R
C
C
R
R
R
R
R
$
R
R
R
R
R
C
C
$
$
$
R
$
R
R
R
$
R
R
C
$
R
C
R
R
R
R
R
R
$
C
C
R
R
R
R
$
R
$
$
$
$
C
R
$
R
C
$
R
C
R
R
R
R
R
C
$
$
$
C
C
R
R
R
$
C
R
R
R
$
$
C
C
$
R
C
R
R
R
R
R
R
C
$
$
C
C
R
R
R
R
R
C
$
C
R
C
R
R
$
C
R
C
R
R
C
R
R (square)
= router
C (circle)
= processor core
$ (diamond)
= L2 cache bank
+ (plus)
= main memory interface
R
• 10x10 mesh of 5-cycle
pipelined routers
– NoC runs at 2GHz
– XY/YX routing
• 64 4GHz 3-wide processor
cores containing
– 8KB L1 Data Cache
– 8KB L1 Instruction Cache
• 32 L2 Cache Banks
– 256KB each
– Organized as shared
NUCA cache
• 4 Main Memory Interfaces
– Labeled with + in the figure
MORFIC: Mesh Overlaid with RFInterConnect
• Shared Z-shaped RF waveguide
• Organized as 8 bidirectional
shortcut links
• Each direction of each shortcut
can transmit simultaneously over
shared medium
• Router A can send a flit to other
router A, B to B, … H to H in a
single cycle
• Router labeled X cannot directly
send to any router not labeled X
– E.g. Router B in upper left cannot
send to router E in upper right
directly
– However, B in upper left can send
to B in upper right, and then north to
E using normal mesh link
A
C
D
B
B
G
H
E
A
F
H
E
F
C
D
D
G
LOGICAL ORGANIZATION
PHYSICAL
256B RF-I
barnes
ocean
lu
watern^2
water-sp
radix
0.82
0.80
0.78
0.76
0.74
fft
barnes
ocean
lu
watern^2
water-sp
radix
0.95
0.90
0.85
0.80
0.75
Normalized Avg Packet Lat
256B RF-I
fft
Normalized Run Cycles
MORFIC Results For 256B Total
RF-I [HPCA’2008]
• 256B RF-I consumes 0.18% silicon overhead on 400mm2 die
– RF-I components: 0.13%, Router overhead: 0.05%
• Normalized Splash-2 Execution Time and Average Packet
Latency Results
– Normalized to baseline mesh run-cycles/latency at 1
– Average 13% (max 18%) performance improvement
– Average 22% (max 24%) packet latency improvement
The Bad News …
Most Interconnect Optimization Techniques
May Not be Relevant …
•
Performance-driven interconnect design based on distributed RC delay model - all 10
versions »
Jason Cong, Kwok-Shing Leung, and Dian Zhou, Design Automation Conference 1993,
Cited by 141 - Related Articles - Web Search - Library Search
•
Interconnect design for deep submicron ICs - all 25 versions »
J Cong, L He, KY Khoo, CK Koh, Z Pan - Proc. Int. Conf. on Computer Aided Design, 1997 doi.ieeecomputersociety.org
Cited by 139 - Related Articles - Web Search
•
Efficient algorithms for the minimum shortest path Steiner arborescence problem with
applications to … - all 11 versions »
Jason Cong, Andrew B. Kahng, and Kwok-Shing Leung,
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND
SYSTEMS, VOL. 17, NO. 1, JANUARY 1998
Cited by 127 - Related Articles - Web Search
•
Buffer block planning for interconnect-driven floorplanning - all 21 versions »
J Cong, T Kong, DZ Pan - Proc. Int. Conf. Computer-Aided Design, 1999 doi.ieeecomputersociety.org
Cited by 130 - Related Articles - Web Search
… (from Google Scholar)
Good News -- Plenty of New
Problems for Future PhD Students
• How many/which routers should be RF-enabled?
– How many RF-I ports should each router have?
• Dedicated or multiplexed with other ports?
• How much RF-I bandwidth to allocate?
– Total? Per communicating pair?
– Impacts active layer area consumed by RF-I components
• Which routing strategy to employ in presence of RF-I express
channels?
• Dynamic or static allocation of frequency bands to
sources/destinations
– Dynamic: requires arbitration overhead for channel
assignment
– Static: may miss opportunity to match changing
communication demand
• Support of multi-cast
Example:
Deadlock: To Avoid or Confront?
• South-Last Strategy [Ogras and Marculescu, 2006]
– Routes which can lead to circular buffer
dependence are forbidden  avoids deadlock
• Deadlock Detection & Recovery (DDR)
– Based on Duato and Pinkston’s theory [Duato and
Pinkston 2001]
• If deadlock occurs, route all packets in the network
on a spare virtual channel
• Use deadlock-free XY-routing
• Packets entering network after this point may be
routed normally
Deadlock Results
– South-Last strategy too restrictive
• Halves the average realizable performance
– Deadlock is best detected and recovered from when it occurs
• Detection happens reasonably quickly
• Performance during recovery no worse than baseline
Example: RF-I Topology and
Bandwidth Optimization
• For each channel
– Source and destination may
be reconfigured via
frequency-band
reassignment
• Can assign variable # of
channels to each source,
destination pair (s,d)
PHYSICAL
– critical channels given more
bandwidth
• A flexible means to
reconfigure topology
B
LOGICAL A
Variance In Communication Patterns
w aterspatial traffic by m anhattan distance
350,000
800,000
300,000
700,000
250,000
600,000
200,000
# msgs
# msgs
m peg2enc traffic by m anhattan distance
150,000
100,000
500,000
400,000
300,000
200,000
50,000
100,000
0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1
4
5 6
7
8
9 10 11 12 13 14
WaterSpatial time varying behavior
1,000,000
10,000,000
100,000
1,000,000
L2 ACCESS
NW INJECT
BW STALL
FLITS SENT
100,000
10,000
event count
event count
Mpeg2Enc time varying behavior
2 3
1,000
100
10
1
1
15 29 43
57 71
L2 ACCESS
NW INJECT
BW STALL
SENT
85 99 113 127 141 155 169 183FLITS
197 211
225 239
interval (250k cycles)
10,000
1,000
100
10
1
1
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85
interval (250k cycles)
Conclusions
• RF-I on CMOS is real
• RF-I is a very promising solution to global
interconnect bottleneck
–
–
–
–
Latency
Bandwidth
Energy
Reconfigurability
• RF-I introduces many interesting physical and
architecture design problems in NoC designs