DTTC Presentation Template

Download Report

Transcript DTTC Presentation Template

0
Thousand Core Chips
A Technology Perspective
Shekhar Borkar
Intel Corp.
June 7, 2007
1
Outline
Technology outlook
Evolution of Multi—thousands of cores?
How do you feed thousands of cores
Future challenges: variations and reliability
Resiliency
Summary
2
Technology Outlook
High Volume
Manufacturing
2004
2006
2008
2010
2012
2014
2016
2018
Technology Node
(nm)
90
65
45
32
22
16
11
8
Integration
Capacity (BT)
2
4
8
16
32
64
128
256
Delay = CV/I
scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Energy/Logic Op
scaling
Bulk Planar
CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
Medium
High
Reduce slowly towards 2-2.5
ILD (K)
~3
<3
RC Delay
1
1
1
6-7
7-8
8-9
Metal Layers
1
Very High
1
1
1
1
0.5 to 1 layer per generation
3
Terascale Integration Capacity
Transistors (Millions)
1.E+06
1.E+05
Total Transistors,
300mm2 die
~100MB
Cache
1.E+04
1.E+03
1.E+02
~1.5B Logic
Transistors
1.E+01
1.E+00
2001
2005
2009
2013
2017
100+B Transistor integration capacity
4
Scaling Projections
40
1.5X Ideal
30
20
1.25X
Realistic
10
0
1.2
Realistic
0.8
0.7X Ideal
0.4
0.0
2001
Power (Watts)
Vdd (Volts)
Frequency (GHz)
50
1,400
1,200
1,000
800
2005
2009
2013
2017
2001
2005
2009
2013
2017
300mm2 Die
Freq scaling will slow down
Vdd scaling will slow down
Power will be too high
Power too high
600
400
200
0
2001
2005
2009
2013
2017
5
Why Multi-core? –Performance
Performance (X)
Pollack's Rule
2X Power = 1.4X Performance
Slope ~ 0.5
1
1
Area (X) or Power (X)
10
Relative Performance
1,000
10
Multi-Core
(Potential)
100
> 10X
10
Single Core
1
2001
2005
2009
2013
2017
Ever increasing single cores yield diminishing
performance in a power envelope
Multi-cores provide potential for near-linear
performance speedup
6
Why Dual-core? –Power
Rule of thumb
Voltage
Frequency
1%
1%
Power
Performance
3%
0.66%
In the same process technology…
Cache
Core
Voltage = 1
Freq
=1
Area
=1
Power = 1
Perf
=1
Cache
Core
Core
Voltage = -15%
Freq
= -15%
Area
= 2
Power = 1
Perf
= ~1.8
7
From Dual to Multi—
Power
Cache
4
Power = 1/4
Performance
Performance = 1/2
3
Large Core
C1
C2
Cache
C3
C4
2
2
1
1
4
4
3
3
2
2
1
1
Small
Core
1
1
Multi-Core:
Power efficient
Better power and
thermal management
8
Future Multi-core Platform
C
GP
C
GP
GPC
GPC
C
GP
SPC
GPC
SPC
C
SP
C
GP
C
GP
SPC
C
GP
C
GP
C
GP
C
GP
General Purpose Cores
Special Purpose HW
Interconnect fabric
Heterogeneous Multi-Core Platform—SOC
9
Fine Grain Power Management
Vdd
0.7xVdd
f
f/2
0
f
f/2
0
f
f/2
0
f
f/2
0
f
f/2
0
f
Cores with critical tasks
Freq = f, at Vdd
TPT = 1, Power = 1
Non-critical cores
Freq = f/2, at 0.7xVdd
TPT = 0.5, Power = 0.25
Cores shut down
TPT = 0, Power = 0
10
Performance Scaling
Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)
Performance
10
8
Serial% = 6.7%
N = 16, N1/2 = 8
16 Cores, Perf = 8
6
4
Serial% = 20%
N = 6, N1/2 = 3
6 Cores, Perf = 3
2
0
0
10
20
30
Number of Cores
Parallel software key to Multi-core success
11
From Multi to Many…
13mm, 100W, 48MB Cache, 4B Transistors, in 22nm
144 Cores
12 Cores
48 Cores
1
1
0.8
0.5
0.6
0.4
0.3
0.2
System Performance
1.2
25
20
15
10
5
0
Small
Med
0
Large
Relative Performance
Single Core
Performance
Large
Med
Small
30
TPT
One
App
Two
App
Four
App
Eight
App
12
From Many to Too Many…
13mm, 100W, 96MB Cache, 8B Transistors, in 16nm
288 Cores
24 Cores
96 Cores
1
1
0.8
0.5
0.6
0.4
0.3
0.2
System Performance
1.2
25
20
15
10
5
0
Small
Med
0
Large
Relative Performance
Single Core
Performance
Large
Med
Small
30
TPT
One
App
Two
App
Four
App
Eight
App
13
On Die Network Power
10000
1000
Small, 1.5MT core
~1000 Cores
100
10
Large, 15MT core
~ 100 Cores
Network Power (W)
Throughput (Relative)
300mm2 Die
1,000
1
2001 2005 2009 2013 2017
4B wide links, 4 links/core
~150W
100
~15W
10
1
2001 2005 2009 2013 2017
A careful balance of:
1. Throughput performance
2. Single thread performance (core size)
3. Core and network power
14
Observations
Scaling Multi— demands more parallelism every
generation
• Thread level, task level, application level
Many (or too many) cores does not always
mean…
•
•
•
The highest performance
The highest MIPS/Watt
The lowest power
If on-die network power is significant, then power
is even worse
Now software, too, must follow Moore’s Law
15
Memory BW Gap
Busses have become
wider to deliver
necessary memory
BW (10 to 30 GB/sec)
6000
5000
Core Clock
MHz
4000
GAP
3000
2000
Bus Clock
1000
0
1985
Yet, memory BW is not
enough
1990
1995
2000
2005
2010
Many Core System will
demand 100 GB/sec
memory BW
How do you feed the beast?
16
Power (mW/Gbps)
IO Pins and Power
30
25
State of the art
State of the art
20
15
10
Research
5
0
0
5
10
15
20
Signaling Rate GBit/sec
State of the art:
100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec  25mw/Gb/sec = 25 Watts
Bus-width = 1,000/5 = 200, about 400 pins (differential)
Too many signal pins, too much power
17
Solution
High speed busses
> 5mm
Chip
Chip
Bus
Busses are transmission lines
L-R-C effects
Need signal termination
Signal processing consumes
power
<2mm
Chip
Chip
Solutions:
Reduce distance to << 5mm
R-C bus
Reduce signaling speed
(~1Gb/sec)
Increase pins to deliver BW
1-2 mw/Gbps
100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec  2mw/Gb/sec = 2 Watts
Bus-width = 1,000/1 = 1,000 pins
18
Anatomy of a Silicon Chip
Heat-sink
Heat
Si Chip
Package Power
Signals
19
System in a Package
Limited pins: 10mm / 50 micron = 200 pins
Si Chip
Si Chip
Package
Limited pins
Signal distance is large ~10 mm – higher power
Complex package
20
DRAM on Top
Heat-sink
Temp = 85°C
High temp, hot spots
Not good for DRAM
DRAM
CPU
Junction Temp = 100+°C
Package
21
DRAM at the Bottom
Heat-sink
Power and IO
signals go through
DRAM to CPU
CPU
DRAM
Thin DRAM die
Through DRAM vias
Package
The most promising solution to feed the beast
22
Reliability
100
Relative
Relative
150
100
50
50
Wider
0
0
120
140 160
180
200
Vt(mV)
16
22
32
45
65
90
0
13
18
0
100
Extreme device variations
Soft Error FIT/Chip (Logic & Mem)
10000
Ion Relative
Jox (Relative)
1
0
?
1000
100
Hi-K?
10
1
1
2
3
4
5
6
7
8
9 10
Time
Time dependent device degradation
180
90
45
22
Burn-in may phase out…?
23
Implications to Reliability
Extreme variations (Static & Dynamic) will result in
unreliable components
Impossible to design reliable system as we know
today
• Transient errors (Soft Errors)
• Gradual errors (Variations)
• Time dependent (Degradation)
Reliable systems with unreliable components
—Resilient mArchitectures
24
Implications to Test
One-time-factory testing will be out
Burn-in to catch chip infant-mortality will not be
practical
Test HW will be part of the design
Dynamically self-test, detect errors,
reconfigure, & adapt
25
In a Nut-shell…
100 BT integration capacity
100
Billion
Transistors
Billions unusable (variations)
Some will fail over time
Intermittent failures
Yet, deliver high performance in the power &
cost envelope
26
Resiliency with Many-Core
C
C
C
C
Dynamic on-chip testing
Performance profiling
C
C
C
C
C
C
C
C
Binning strategy
C
C
C
C
Dynamic, fine grain,
performance and power
management
Dynamically…
1. Self test & detect
2. Isolate errors
3. Confine
4. Reconfigure, and
5. Adapt
Cores in reserve (spares)
Coarse-grain redundancy
checking
Dynamic error detection &
reconfiguration
Decommission aging cores,
swap with spares
27
Summary
Moore’s Law with Terascale integration capacity
will allow integration of thousands of cores
Power continues to be the challenge
On-die network power could be significant
Optimize for power with the size of the core and
the number of cores
3D Memory technology needed to feed the beast
Many-cores will deliver the highest performance in
the power envelope with resiliency
28