Transcript Document

ATAC: Multicore Processor with
On-Chip Optical Network
George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu,
Jurgen Michel, Lionel C. Kimerling, Anant Agarwal
Massachusetts Institute of Technology
Cambridge, MA 02139
Presenter : Yeluo Chen
Background
• Number of transistor on a chip will double every two years
• Multicore processor will have 1000 core or more within the next decade
Overcome Challenges In order to Improve Performance
ATAC address the Chanllenges
2
“Moore’s Gap”
Performance
(GOPS)
Tiled Multicore
1000
Multicore
100
10
1
0.1
The
GOPS
Gap
SMT, FGMT, CGMT
OOO
Superscalar
 Diminishing returns from
single CPU mechanisms
(pipelining, caching, etc.)
 Wire delays
 Power envelopes
Pipelining
0.01
1992
1998
2002
2006
2010
time
3
Multicore Scaling Trends
Today
Tomorrow
A few large cores on each chip
Diminishing returns prevent cores from getting
more complex
Only option for future scaling is to add more cores
100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007]
Simple cores are more power and area efficient
Global structures do not scale; all resources must be
distributed
Still some shared global structures: bus, L2 caches
m
p
m
p
p
switch
m
c
m
m
switch
m
L2 Cache
m
switch
m
p
switch
p
switch
m
p
switch
p
m
m
p
m
switch
p
switch
switch
p
switch
p
p
switch
p
m
m
p
switch
p
c
BUS
m
p
m
p
switch
switch
p
switch
switch
4
The Future of Multicore
Number of cores doubles
every 18 months
Parallelism replaces clock frequency
scaling and core complexity
Resulting Challenges…
Scalability
Programming
Power
MIT RAW
Sun Ultrasparc T2
IBM XCell 8i
Tilera TILE64
5
Multicore Challenges
 Scalability
 How do we turn additional cores into additional performance?
 Must accelerate single apps, not just run more apps in parallel
 Efficient core-to-core communication is crucial
 Architectures that grow easily with each new technology generation
 Programming
 Traditional parallel programming techniques are hard
 Parallel machines were rare and used only by rocket scientists
 Multicores are ubiquitous and must be programmable by anyone
 Power
 Already a first-order design constraint
 More cores and more communication  more power
 Previous tricks (e.g. lower Vdd) are running out of steam
6
Multicore Communication Today
Bus-based Interconnect
p
p
Single shared resource
c
c
Uniform communication cost
BUS
L2 Cache
Communication through memory
Doesn’t scale to many cores due to contention and
long wires
Scalable up to about 8 cores
DRAM
7
Multicore Communication Tomorrow
Point-to-Point Mesh Network
m
m
p
m
p
switch
m
switch
m
p
m
switch
m
m
switch
m
p
switch
p
switch
m
p
switch
p
m
m
p
m
switch
p
switch
switch
p
switch
p
p
switch
p
m
m
p
m
p
switch
switch
p
switch
switch
Examples: MIT Raw, Tilera TILEPro64, Intel Terascale
Prototype
Neighboring tiles are connected
Distributed communication resources
Non-uniform costs:
Latency depends on distance
Encourages direct communication
DRAM
DRAM
DRAM
DRAM
More energy efficient than bus
Scalable to hundreds of cores
8
ATAC Advantages
ATAC Processor Architecture:
 On-chip Optical Communication
 Wavelength Division Multiplexing (WDM)
 Simultaneously carry multiple independent signal on different
wavelength
ie. 64 WDM = 64 bit Electrical Bus
 Eliminate Communication Contention
 Low Loss Less Power
Not require Periodic Repeater
 Eliminate multiple hops between cores at large scale
9
ATAC Optical Building Blocks
 Light Source (Optical PWR Supply)
 Generated by Off-Chip Laser (PWR ~1.5W)
 Coupled into an on chip waveguide
 Waveguide
 On-chip channels for light transmission (Made of Si)
 Manufactured with CMOS process
 Optical Filter (Ring Resonator) & Modulator
 Couple specific wavelength from PWR supply waveguide to data waveguide
 Translates an electrical signal into an optical signal
 Place signal into Waveguide
 Photodetector
 Absorbs photons and outputs electrical signal
10
Optical bit transmission
 Putting it together
 Optical data transmission from one core to another
11
ATAC Architecture
Electrical Mesh Interconnect
m
m
p
m
p
switch
m
switch
p
switch
p
p
m
switch
p
m
m
switch
p
switch
m
p
switch
switch
p
m
p
p
m
switch
switch
m
switch
p
p
m
switch
Consist of
Optical
Waveguides
p
m
m
m
m
p
switch
switch
p
switch
switch
Optical Broadcast WDM Interconnect
12
The 1000 Cores ATAC
memory
BNet
Proc
ONet
$
HUB
Dir $
ENet
NET
memory
64 Optically-Connected Clusters
Electrical Networks Connect
16 Cores to Optical Hub
 Purely optical design scales to about 64 cores
 Global Optical Interconnect: ANet
 After that, clusters of cores share optical hubs
 ENet and BNet move data to/from optical hub
 Dedicated, special-purpose electrical networks
13
Optical Broadcast Network
 Waveguide passes
through every core
 Multiple wavelengths
(WDM) eliminates
contention
Each core sends data using a different
wavelength
no contention
 Signal reaches all cores
in <3ns
Low Latency
 Same signal can be
received by all cores
Data is sent once, any or all cores can
receive it = efficient broadcast
optical waveguide
14
Optical Broadcast Network
N cores
 Electronic-photonic
integration using
standard CMOS
process
 Cores communicate
via optical WDM
broadcast and select
network
 Each core sends on
its own dedicated
wavelength using
modulators
 Cores can receive
from some set of
senders using
optical filters
15
Core-to-core communication




32-bit data words transmitted across several parallel waveguides
Each core contains receive filters and a FIFO buffer for every sender
Data is buffered at receiver until needed by the processing core
Receiver can screen data by sender (i.e. wavelength) or message type
ONet
32
Processor
Core
FIFO
32
FIFO
FIFO
FIFO
FIFO
32
FIFO
Bundle of
waveguides
32
Processor
Core
sending core A sending core B
Processor Core
receiving core
16
ATAC Bandwidth
64 cores, 32 lines, 1 Gb/s
 Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s
 Receive-Weighted BW: 2 Tb/s * 63 receivers = 126 Tb/s
 Good metric for broadcast networks – reflects WDM
ATAC allows better utilization of computational resources
because less time is spent performing communication
17
ATAC Efficiency
 Cores can directly communicate with any other core in
one hop (<3ns)
 Broadcasts require just one send
 No complicated routing on network required
 Cheap broadcast enables frequent global
communications
 Energy required ANet – 300fJ/bit
Electrical Signal – 94fJ/bit/mm
 1000 Cores
1mm/hop
 Electrical signal if destination less than four hops
 Optical signal for Broadcast and long unicast message
18
ATAC Performance Simulation &Evaluation
 Parsec and Splash2 Benchmark
 Nine Applications from Splash2
 Three Applications from Parsec
 ANet vs. Emesh
 Cache Coherence Protocol
 ACKwise
 DirB
 DirNB
 Performance Evaluation and Comparison of ANet and Emesh
combinations
 Each type of network couple with a coherence Protocol
 Six combination evaluated
(a) ANet-ACKwisek, (b) ANet-DirkB, (c) ANet-DirkNB, (d) EMesh-ACKwisek, (e)
EMesh- DirkB and (f) EMesh-DirkNB
19
ATAC Performance Simulation &Evaluation
64 Cores simulation:
ANet^64 compared to 64 bit wide E-mesh network
1024 Cores simulation:
ANet^1024 compared to 256 bit wide E-mesh network
20
ATAC Performance Simulation &Evaluation
DirB
ACKwise
21
ATAC Performance Simulation &Evaluation
22
System Capabilities and Performance
Baseline: Raw Multicore Chip
 Leading-edge tiled multicore
64-core system (65nm process)






Peak performance: 64 GOPS
Chip power: 24 W
Theoretical power eff.: 2.7 GOPS/W
Effective performance: 7.3 GOPS
Effective power eff: 0.3 GOPS/W
Total system power: 150 W
ATAC Multicore Chip
 Future optical interconnect multicore
64-core system (65nm process)






Peak performance: 64 GOPS
Chip power: 25.5 W
Theoretical power eff.: 2.5 GOPS/W
Effective performance: 38.0 GOPS
Effective power eff.: 1.5 GOPS/W
Total system power: 153 W
Optical communications require a small amount of
additional system power but allow for much better
utilization of computational resources.
23
Communication-centric Computing
 View of extended global memory can be enabled cheaply with on-chip
distributed cache memory and ATAC network
 ATAC reduces off-chip memory calls, and hence energy and latency
Operation
Energy
Latency
memory
500pJ
3pJ
500pJ
500pJ
500pJ
p
p
c
c
3pJ
3pJ
BUS
3pJ
Network
transfer
3pJ
3
cycles
ALU add
operation
2pJ
1 cycle
32KB cache
read
50pJ
1 cycle
Off-chip
memory
read
500pJ
250
cycles
L2 Cache
Bus-Based
Multicore
ATAC
24
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores will
double every 18 months
‘02
‘05
‘08
‘11
‘14
Research
16
64
256
1024
4096
Industry
4
16
64
256
1024
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self respecting OS!)
25
ATAC is an Efficient Network
• Modulators are Primary Source of Power Consumption
– Receive Power: Require only ~2 fJ/bit even with -5dB link loss
– Modulator Power:
• Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver)
• Example: 64-Core Communication
(i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core)
– Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W
– Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W
– Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit
• Comparison: Electrical Broadcast Across 64 Cores
– Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power)
(Assumes 150fJ/mm/bit, 1-mm spaced tiles)
26
Summary
ATAC uses optical networks to enable multicore
programming and performance scaling
ATAC encourages communication-centric architecture,
which helps multicore performance and power scalability
ATAC simplifies programming with a contention-free allto-all broadcast network
ATAC is enabled by recent advances in CMOS
integration of optical components
27