On-Chip Optical Communication for Multicore Processors Jason Miller Carbon Research Group MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB.

Download Report

Transcript On-Chip Optical Communication for Multicore Processors Jason Miller Carbon Research Group MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB.

On-Chip Optical Communication for Multicore Processors

Jason Miller Carbon Research Group

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB

“Moore’s Gap”

Performance (GOPS)

Tiled Multicore

1000 100 10 1 0.1

0.01

Multicore

SMT, FGMT, CGMT Superscalar OOO Pipelining 1992 1998 2002 The GOPS Gap    Diminishing returns from single CPU mechanisms (pipelining, caching, etc.) Wire delays Power envelopes time 2006 2010 2

Multicore Scaling Trends

Today

A few large cores on each chip Diminishing returns prevent cores from getting more complex Only option for future scaling is to add more cores Still some shared global structures: bus, L2 caches

Tomorrow

100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007] Simple cores are more power and area efficient Global structures do not scale; all resources must be distributed

p c BUS L2 Cache p c m p m p m p m p switch switch switch switch m p m p m p m p switch switch switch switch m p m p m p m p m p switch switch m p switch m p switch switch m p switch switch switch

3

MIT RAW

The Future of Multicore

Number of cores doubles every 18 months Parallelism replaces clock frequency scaling and core complexity Resulting Challenges… Scalability Programming Power IBM XCell 8i Sun Ultrasparc T2 Tilera TILE64 4

Multicore Challenges

 Scalability  How do we turn additional cores into additional performance?

 Must accelerate single apps, not just run more apps in parallel  Efficient core-to-core communication is crucial  Architectures that grow easily with each new technology generation  Programming  Traditional parallel programming techniques are hard   Parallel machines were rare and used only by rocket scientists Multicores are ubiquitous and must be programmable by anyone  Power  Already a first-order design constraint  More cores and more communication   Previous tricks (

e.g.

more power lower Vdd) are running out of steam 5

Multicore Communication Today

p c BUS L2 Cache p c

DRAM

Bus-based Interconnect

Single shared resource Uniform communication cost Communication through memory Doesn’t scale to many cores due to contention and long wires Scalable up to about 8 cores 6

Multicore Communication Tomorrow

Point-to-Point Mesh Network

m p m m p p switch m p m p switch m p switch m p m p switch switch switch switch switch m p m p m p m p switch switch switch switch m m p m p p m p switch switch switch switch

Examples: MIT Raw, Tilera TILEPro64, Intel Terascale Prototype Neighboring tiles are connected Distributed communication resources Non-uniform costs: Latency depends on distance Encourages direct communication More energy efficient than bus Scalable to hundreds of cores 7

Multicore Programming Trends

Meshes and small cores solve the physical scaling challenge, but programming remains a barrier Parallelizing applications to thousands of cores is hard

 Task and data partitioning   Communication becomes critical as latencies increase Increasing contention for distant communication  Degraded performance, higher energy  Inefficient broadcast-style communication  Major source of contention  Expensive to distribute signal electrically 8

Multicore Programming Trends

For high performance, communication and locality must be managed

   Tasks and data must be both partitioned and

placed

   Analyze communication patterns to minimize latencies Place data near the code that needs it most Place certain code near critical resources (e.g. DRAM, I/O) Dynamic, unpredictable communication is impossible to optimize Orchestrating communication and locality increases programming difficulty exponentially 9

Improving Programmability Observations:

 A cheap broadcast communication mechanism can make programming easier  Enables convenient programming models (e.g., shared memory)  Reduces the need to carefully manage locality  On-chip optical components enable cheap, energy efficient broadcast 10

ATAC Architecture

Electrical Mesh Interconnect

m p m p m p switch switch m p m p m p switch switch m p m p m p switch switch m p m p m p switch switch switch switch switch switch m p m p m p m p switch switch switch switch

Optical Broadcast Network

    Waveguide passes through every core Multiple wavelengths (WDM) eliminates contention Signal reaches all cores in <2ns Same signal can be received by all cores optical waveguide 12

Optical Broadcast Network

N cores     Electronic-photonic integration using standard CMOS process Cores communicate via optical WDM broadcast and select network Each core sends on its own dedicated wavelength using modulators Cores can receive from some set of senders using optical filters 13

Optical bit transmission

  Each core sends data using a different wavelength  no contention Data is sent once, any or all cores can receive it  efficient broadcast multi-wavelength source waveguide modulator data waveguide modulator driver flip-flop sending core filter photodetector transimpedance amplifier flip-flop receiving core 14

Core-to-core communication

    32-bit data words transmitted across several parallel waveguides Each core contains receive filters and a FIFO buffer for every sender Data is buffered at receiver until needed by the processing core Receiver can screen data by sender (i.e. wavelength) or message type FIFO Processor Core 32 32 FIFO 32 32 Processor Core sending core A sending core B Processor Core receiving core 15

ATAC Bandwidth

64 cores, 32 lines, 1 Gb/s   Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s Receive-Weighted BW: 2 Tb/s * 63 receivers = 126 Tb/s  Good metric for broadcast networks – reflects WDM ATAC allows better utilization of computational resources because less time is spent performing communication 16

System Capabilities and Performance

Baseline: Raw Multicore Chip  Leading-edge tiled multicore 64-core system (65nm process)  Peak performance: 64 GOPS  Chip power: 24 W     Theoretical power eff.: 2.7 GOPS/W Effective performance:

7.3 GOPS

Effective power eff:

0.3 GOPS/W

Total system power: 150 W ATAC Multicore Chip  Future optical interconnect multicore 64-core system (65nm process)  Peak performance: 64 GOPS  Chip power: 25.5 W     Theoretical power eff.: 2.5 GOPS/W Effective performance :

38.0 GOPS

Effective power eff.:

1.5 GOPS/W

Total system power: 153 W

Optical communications require a small amount of additional system power but allow for much better utilization of computational resources.

17

Programming ATAC

    Cores can directly communicate with

any

one hop (<2ns) other core in Broadcasts require just one send No complicated routing on network required Cheap broadcast enables frequent global communications   Broadcast-based cache update/remote store protocol  All “subscribers” are notified when a writing core issues a store (“publish”) Uniform communication latency simplifies scheduling 18

Communication-centric Computing

 ATAC reduces off-chip memory calls, and hence energy and latency  View of extended global memory can be enabled cheaply with on-chip distributed cache memory and ATAC network Operation Energy Latency memory

500pJ 500pJ 500pJ p c BUS L2 Cache p c 500pJ 3pJ 3pJ 3pJ 3pJ

Network transfer 3pJ ALU add operation 2pJ 32KB cache read 50pJ 3 cycles 1 cycle 1 cycle Bus-Based Multicore ATAC Off-chip memory read 500pJ 250 cycles 19

Summary ATAC uses optical networks to enable multicore programming and performance scaling

ATAC encourages communication-centric architecture, which helps multicore performance and power scalability ATAC simplifies programming with a contention-free all to-all broadcast network ATAC is enabled by recent advances in CMOS integration of optical components 20

Backup Slides

What Does the Future Look Like?

Corollary of Moore’s law: Number of cores will double every 18 months

‘02 ‘05 ‘08 ‘11 ‘14

Research Industry 16 4 64 16 256 64 1024 256 4096 1024

1K cores by 2014! Are we ready?

(Cores minimally big enough to run a self respecting OS!) 22

Scaling to 1000 Cores

memory BNet Proc ONet HUB ENet memory 64 Optically-Connected Clusters Electrical Networks Connect 16 Cores to Optical Hub   Purely optical design scales to about 64 cores After that, clusters of cores share optical hubs   ENet and BNet move data to/from optical hub Dedicated, special-purpose electrical networks NET Dir $ $ 23

ATAC is an Efficient Network

• Modulators are Primary Source of Power Consumption – Receive Power: Require only ~2 fJ/bit even with -5dB link loss – Modulator Power: • Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver) • Example: 64-Core Communication (i.e. N = 64 cores = 64  s; for 32 bit word: 2048 drops/core and 32 adds/core) – Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N 2 = 262  W – Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153  W – Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit • Comparison: Electrical Broadcast Across 64 Cores – Require 64 x 150fJ/bit = 10 pJ/bit ( ~50X more power ) (Assumes 150fJ/mm/bit, 1-mm spaced tiles) 24