Design of a Scalable Nanophotonic Interconnect for Future

Download Report

Transcript Design of a Scalable Nanophotonic Interconnect for Future

Design of a Scalable Nanophotonic
Interconnect for Future Multicores
Avinash K. Kodi and Randy W. Morris, Jr.
Department of Electrical Engineering and Computer Science
Ohio University, Athens, OH 45701
E-mail: [email protected], [email protected]
ACM/IEEE Symposium on Architectures for Networking and Communications
Systems, Princeton, New Jersey
October 19-20, 2009
1
Talk Outline
• Section I: Motivation & Background
• Section II: PROPEL Architecture
• Section III: E-PROPEL Architecture
• Section IV: Performance Analysis
• Section V: Conclusion
2
Chip Multi-Processor
SPARC processor-16cores, 65 nm, 20082
Multicores have arrived
-Future processors will be comprised of
100’s to 1000’s of cores
Intel Tera-FLOPS, 80-cores, 65 nm, 20071
IBM cell processor, 8-cores, 90 nm, 20043
1.Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, ”A 5-ghz mesh inter-connect for a teraFLOPS processor," IEEE Micro, pp. 51-61, September/October 2007
2.G. Konstadinidis et. al., “Architecture and physical implementation of a third generation 65 nm, 16 cores, 32 thread chip-multithreading sparc processor,“ IEEE Journal
of Solid-State Circuits, no. 1, p. 717, January 2009.
3.The Cell project at IBM Research, http://www.research.ibm.com/cell/home.html
3
Network-on-Chip (NoC)
Core
(3,0)
Core
(3,1)
Core
(3,2)
Core
(3,3)
Route
Computation
(RC)
Core
Switch Allocator
(SA)
Router
Core
(2,0)
Core
(2,1)
Core
(2,2)
Core
(2,3)
Core
(1,0)
Core
(1,1)
Core
(1,2)
Core
(1,3)
Core
(0,0)
Core
(0,1)
Core
(0,2)
Virtual Channel
(VC)
Link
Credits
In/Out
Crossbar Switch
+X
+X
-X
-X
+Y
+Y
-Y
-Y
Core
(0,3)
Processing Core
-Overcomes the problems of scalability and wire delay
4
Power Dissipation
Recent NSF-sponsored workshop on On-Chip Interconnection Networks1 :
• Power consumption of NOCs implemented with current techniques – exceeds
expected needs by a factor of 10.
Tile Power: Intel Tera-Flops (65 nm)2
Potential Solutions
- Nanophotonics
Clock Distribution 11%
Dual FPMACs 36 %
- Wireless/RF
28%
Router & Links 28 %
10-port RF 4%
IMEM + DMEM 21%
- 3D stacking
1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp.
96 – 108, September-October 2007.
2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61
5
Why use Nanophotonics?
• CMOS compatible
• Low Power (0.1 mW)
• Small Footprint (~10 µm)
• High Bandwidth (~10 Gbps)
• Low Latency (10.45 ps/mm)
1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.
2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of Lightwave Technologies, Vol. 23,
No. 12, 12 December 2005 (invited).
6
Optical Interconnect
On-Chip
Off-Chip
Laser
On-Chip
Modulator
Transmission
Medium
Optical
Layer
Photodetector
Electronics
Layer
Buffer Chain
TIA
Limiting
Amplifier
Driver for
Electronics
On-chip Modulator
-Mach-Zehnder modulator or Micro-Ring Resonator
Transmission Medium
- Freespace or Waveguide (Polymer or Silicon)
Photodetectors
- GaAs, III-V materials, Ge-on-SOI (Silicon-on-Insulator)
7
Micro-ring Resonators
n+
p+
n+
Input Port 0
VR =VOFF
Output Port 0
Resonant wavelength (λ0)
λ0  m= neff  2R
m  an integer
neff  effective refractive index
R  radius of the ring resonator
Output Port 1
n+
Input Port 0
p+
n+
n+
VR =VOFF
Output Port 0
Input Port 0
p+
n+
VR =VON
Output Port 0
8
Electrical Interconnect
rs
R, C
R =wire resistant per length
sopt
Cp
C0
lopt
C =wire capacitance per length
Cp=inverter output capacitance
C0=inverter input capacitance
Rs= inverter resistance
Sopt=inverter size
Lopt = Wire distance
RC Link:
9
ITRS 2007 Transistor & Link
Parameters?
Electrical link device parameters for various VLSI technologies
Device
90 nm
65 nm
45nm
32nm
22nm
Vdd
1.2
1.1
1
0.9
0.8
fclk
3.088
4.7
5.875
7.344
9.18
R
122
220
312
382
455
C
170
165
160
155
150
Cp
1
0.9
0.8
0.712
0.544
Co
0.5
0.45
0.4
0.356
0.272
Rs
1890
2200
3500
4700
6900
Sopt
72.5
60.5
66.9
73.1
91.4
Lopt
0.45
0.35
0.25
0.18
0.13
50
70
100
150
220
65
100
100
100
100
Ioffn
(nA/micron)
Ishortckt
(nA/micron)
• Increase wire delay due to RC constant
• Increase in Ioffn & Ishortckt current parameters
10
Waveguide & Receiver
WAVEGUIDE
Pitch (um)
Propagation Time
(ps)
Optical Loss (dB/cm)
Si [1]
5.5
10.45
1.3
Polymer [1]
20
4.93
1.0
RECEIVER
Power (mW/Gbps)
Area (mm2)
Si-CMOS-Amplifier [2]
1.1
0.02625
80 nm CMOS [3]
2.5
0.0625
SiGe BiCMOS [4]
24.5
1.07
[1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”,
39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9 , Iss. 13 Dec. 2006 pg.492 – 50
[2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication
Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007
[3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical
Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005
[4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing
station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004
11
Electrical/Optical Comparison
Power-delay product at various technology nodes for a 5 mm link.
Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects
12
Critical Length
Critical Length is the distance where optical becomes more advantageous
core-to-core distance
13
Why PROPEL?
• Related Work
– Corona (ISCA 2008), Circuit-switch(IEEE Transaction 2008),
Shared-bus (Micro 2006)
• Reduce hardware complexity
– Current proposed nanophotonic networks use large number of
optical components
• Nanophotonic for communication (links) and
electronics for switching
– No optical arbitration required
– Balance between cheaper electronic and more costly optics
• Scalable network design
14
Proposed Architecture: PROPEL
Offsource
Laser
Concentration = 4
Tile (3,0)
X-direction
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
L
2
y-direction
 0,  1,  2, …
Tile
(3,0)
Tile
(3,1)
Tile
(3,2)
Tile
(3,3)
Tile
(2,0)
Tile
(2,1)
Tile
(2,2)
Tile
(2,3)
Tile
(1,0)
Tile
(1,1)
Tile
(1,2)
Tile
(1,3)
Tile
(0,0)
Tile
(0,1)
Tile
(0,2)
Tile
(0,3)
15
PROPEL’s Routing & Wavelength
Assignment (x-direction)
λa(b,c) -
a = wavelength, b = destination tile, c = x-direction
λ1(0,0)
Home Channel 0
λ3(0,0)
λ2(0,0)
Home Channel 1
Home Channel 2
λ3(1,0)
λ2(1,0)
λ0(1,0)
Home Channel 3
λ1(0,0)+ λ2(0,0)+ λ3(0,0)
λ0(1,0)+ λ2(1,0)+ λ3(1,0)
Core 0
Core 1
Core 4
Core 5
Core 8
Core 9
Core
12
Core
13
Core 2
Core 3
Core 6
Core 7
Core
10
Core
11
Core
14
Core
15
Tile (3,0)
Tile (3,1)
Tile (3,2)
Tile (3,3)
16
Communication Example
- Tile (3,3) communicates with Tile (0,0)
Laser
Core 0
Core 2
X
b
a
r
Core 1
Router (3,0)
Core 3
Tile (3,0)
Crossbar Switch
X0
x-direction
Core
48
Core
50
X
b
a
r
X2
Y0
Y1
aaaaaaaaaaaaaaaaaaaaaaaaaa
Y2
Y2
Tile (0,0)
X0
Core
15
X1
Y0
Y1
Core
14
X
b
a
r
Core
13
Tile (3,3)
X1
X2
Core
12
Core
49
Core
51
L2 Cache
17
Communication Example
- Tile (3,3) communicates with Tile (0,0)
Laser
Core 0
Core 2
X
b
a
r
Core 1
Core
12
Core
13
Core 3
Core
14
Core
15
Tile (3,3)
Tile (3,0)
y-direction
aaaaaaaaaaaaaaaaaaaaaaaaaa
Tile (0,0)
Core
12
Core
13
Core
14
Core
15
18
Need for E-PROPEL
• Related work
- Corona (ISCA 2008), Processor-DRAM (HOT
Interconnects 2008), Firefly (ISCA 2009)
• Issues with 256-core version of PROPEL
- xbar (15×15), Area (Waveguides), Power dissipation
• Advantages of E-PROPEL
- Non-blocking crossbar, multiple roots (Fat tree), reduce
components (over PROPEL)
19
E-PROPEL Design
Combine 4 PROPELs with nanophotonic crossbars
Non-blocking
Optical Xbar
Cluster 0
Non-blocking
Optical Xbar
Non-blocking
Optical Xbar
Cluster 1
Non-blocking
Optical Xbar
Non-blocking
Optical Xbar
Cluster 2
Non-blocking
Optical Xbar
Non-blocking
Optical Xbar
Cluster 3
Non-blocking
Optical Xbar
RE-PROPEL: Top and bottom tiles are only connected
20
Crossbar Functionality
Input 0
λ(0)(0-15), λ(0)(16-31), λ(0)(32-47), λ(0)(48-63)
λ(0)(0-15), λ(1)(16-31), λ(2)(32-47), λ(3)(48-63)
λ(1)(0-15), λ(1)(16-31), λ(1)(32-47), λ(1)(48-63)
λ(1)(0-15), λ(2)(16-31), λ(3)(32-47), λ(0)(48-63)
Input 1
Input 2
λ(2)(0-15), λ(2)(16-31), λ(2)(32-47), λ(2)(48-63)
λ(3)(0-15), λ(3)(16-31), λ(3)(32-47), λ(3)(48-63)
Input 3
4-Input 64Wavelength
AWG Crossbar
λ(2)(0-15), λ(3)(16-31), λ(0)(32-47), λ(1)(48-63)
λ(3)(0-15), λ(0)(16-31), λ(1)(32-47), λ(2)(48-63)
Output 0
Output 1
Output 2
Output 3
21
Nanophotonic Crossbar (single ring)
(cluster 0)
(cluster 0)
Output 0
Input 0
λ(32-47)
λ(16-31)
(cluster 1)
Output 1
(cluster 1)
Input 1
λ(0-15)
λ(0-15)
(cluster 2)
(cluster 2)
Input 2
Output 2
λ(48-63)
λ(32-47)
(cluster 3)
Input 3
(cluster 3)
Output 3
22
Nanophotonic Crossbar (double ring)
Input 0
λ(0-15)
λ(32-47)
Input 1
Input 2
Input 3
λ(16-31)
λ(0-15)
Output 0
λ(0-15)
λ(48-63)
λ(15-31) λ(48-63)
Output 1
Output 2
Output 3
23
Performance Evaluation
• Optical & Electrical Component Comparison
• Synthetic Traffic
– Simulated with OPTISIM
– Uniform, Bit-reversal, Butterfly, Complement, Matrix transpose,
Perfect Shuffle
• SPLASH-2
– Traces collected on Simics with GEMS
– FFT, LU, Radiosity, Ocean, Raytrace, Radix, Water, FMM and
Barnes
• Networks topologies
– Electrical: Mesh, Cmesh and Flattened-butterfly
– Optical: Circuit-switch, Shared-bus and Corona
24
Component Comparison: PROPEL
Shared-Bus
Circuit-Switch
Corona
PROPEL
Wavelengths
4
24
64
64
Waveguides
168
64
99
32
Micro-rings
2,688
16,576
72,192
3,072
Photodetectors
1,536
2,016
7,424
1,536
Power Loss (dB)
37
39.2
49.2
32.1
Optical Area
(mm2)
16
49
64.6
17
Electrical Area
(mm2)
60
55
195
50
PROPEL is the most cost effective NoCs
25
Component Comparison: E-PROPEL
Corona
PROPEL
E-PROPEL
RE-PROPEL
Wavelengths
64
64
64
64
Waveguides
387
256
192
160
Micro-rings
1,081,344
28,672
19,968
16,128
Photodetectors
32,768
14,336
9,216
7,680
Power Loss (dB)
49
44
42
41
Optical Area
(mm2)
337
181
96
85
Electrical Area
(mm2)
860
395
280
240
26
Power Dissipation Evaluation
Modulator
(0.1mW/Gb)4
Buffers
Xbar
(8.06mW)1 (8.66mW)2
Electrical Links
(44mW)3
TIA/Amplifier
(1.1mW/Gb)5
1,2. B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, “Express cube topologies for on-chip interconnects,” in the Proceeding of 15th
International Symposium on High Performance Computer Architecture, Feburary 2009, pp. 163–174.
3. Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary, “Firefly: Illuminating future network-on-chip with nanophotonics,” in
the Proceedings of the 36th annual International Symposium on Computer Architecture, 2009.
4. Q. Xu, S. Manipatruni, B. Schmidt, J. Shakya, and M. Lipson, “12.5 gbit/s carrier-injection-based silicon micro-ring silicon modulators,”
Optics Express:The International Electronic Journal of Optics, vol. 15, no. 2, January 2007.
5. S. J. Koester, C. L. Schow, L. Schares, and G. Dehlinger, “Ge-on-soi-detector/si-cmos-amplifier receivers for high-performance
opticalcommunication applications,” Journal of Lightwave Technology, vol. 25, no. 1, pp. 46–57, January 2007.
27
Uniform Traffic
Mesh
Cmesh
Flattened-Butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
160
Latency
1200
Mesh
Cmesh
Flattened-Butterfly
Circuit-switch
Shared-bus
Corona
PROPEL
1000
120
Latency (nS)
Throughput (GBps)
140
Throughput
100
80
60
800
600
400
40
200
20
0
0.1
0.2
0.3
0.4
0.5
0.6
Network Load
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Network Load
Throughput
- 25% increase performance over Mesh
- Over 2× increase in performance over Circuit-switch, Cmesh and
Shared-bus
28
Throughput: Synthetic Traffic Traces
3
Throughput
2.5
2
1.5
1
0.5
0
Uniform
Mesh
Cmesh
BitReversal
Butterfly
Flattened-Butterfly
Complement
Circuit-Switch
Matrix
Transpose
Perfect Shuffle
Shared-Bus
Corona
Neighbor
PROPEL
-50% increase over mesh for bit-reversal, matrix transpose, and perfect
shuffle
29
Power Dissipation: Synthetic Traffic
1.2
1
Power
0.8
0.6
0.4
0.2
0
Uniform
Mesh
CMESH
BitReversal
Butterfly
Flattened-Butterfly
Complement
Circuit-Switch
Matrix
Transpose
Perfect Shuffle
Shared-Bus
Corona
Neighbor
PROPEL
- PROPEL decreases power consumption by a factor of 5
30
Splash-2 Speed up
3
Speed Up
2.5
2
1.5
1
0.5
0
FFT
LU
Radiosity
Mesh
Ocean
Raytrace
Radix
Flattened-butterfly
Water
FFM
Barnes
PROPEL
-PROPEL speed-up LU, Ocean, Radix, Water, FFM and Barnes
by of factor of 2
-FFT, Radiosity and Raytrace have a speed-up of about 1.5 ×
31
Splash-2 Power Dissipation
1.2
Power
1
0.8
0.6
0.4
0.2
0
FFT
LU
Radiosity
Mesh
Ocean
Raytrace
Radix
Flattened-butterfly
Water
FFM
Barnes
PROPEL
- PROPEL decreases power consumption by a factor of 10
32
Throughput
E-PROPEL Throughput
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Uniform
BitReversal
Mesh
Butterfly
PROPEL
Complement
Matrix
Transpose
E-PROPEL
Perfect
Shuffle
Neighbor
RE-PROPEL
- E-PROPEL throughput is similar to PROPEL except for Uniform, Matrix
Transpose, and Perfect Shuffle
-RE-PROPEL only slightly decreases performance over E-PROPEL
-E-PROPEL improves performance by 2x over mesh
33
E-PROPEL Power
1.2
Power
1
0.8
0.6
0.4
0.2
0
Uniform
Mesh
Bit-Reversal
Butterfly
PROPEL
Complement
Matrix
Transpose
E-PROPEL
Perfect
Shuffle
Neighbor
RE-PROPEL
- E-PROPEL and RE-PROPEL reduce power dissipation by a factor of 3
34
Conclusion
• PROPEL and E-PROPEL are both a low power high
bandwidth NoC for future many-core processors
• PROPEL and E-PROPEL uses both electronic for packet
switching and optics for inter-router communication,
allowing for a reduction in electrical and optical
components
• PROPEL and E-PROPEL are able to outperform and
dissipate less power when compared to well-known
network topologies
• In future work, incorporate adaptive routing technique to
balance the load across the entire network
35
36
SPLASH-2 Setup
Application
Benchmark
FFT 16 K particles
16 K particles
LU
512 × 512 particles
Radiosity
Largeroom
Ocean
258 × 258
Radix
1 M integers
Water
512 molecules
FMM
16 K particles
Barnes
16 k particles
37
Simulation Parameters (electrical)
Parameter
Mesh
Cmesh
FlattenedButterfly
Bisection
Bandwidth(Tbp
s)
4.096
4.096
8.192
Router Size
(xbar)
5×5
8×8
10×10
VCs
(per Input)
4
4
4
Electrical
Channel Rate
(Gbps)
256
256
256
38
Simulation Parameters (Optical)
Parameter
Shared-Bus
Circuitswitch
Corona
PROPEL
Bisection
Bandwidth(
Tbps)
15.4
0.51
40.96
5.12
Router Size
(xbar)
4x4
5x5
-
8x8
VCs
(per Input)
4
4
4
4
Electrical
Channel
Rate (Gbps)
64
256
-
256
Optical
Channel
Rate (Gbps)
240
128
2560
160
39