DTTC Presentation Template - Georgia Institute of Technology

Download Report

Transcript DTTC Presentation Template - Georgia Institute of Technology

VLSI Design Challenges
for
Gigascale Integration
Shekhar Borkar
Intel Corp.
October 25, 2005
R
®
1
Outline
Technology scaling challenges
 Circuit and design solutions
 Microarchitecture advances
 Multi-everywhere
 Summary

2
Goal: 10 TIPS by 2015
10000000
1000000
Pentium® 4 Architecture
100000
MIPS
10000
1000
Pentium® Pro Architecture
Pentium® Architecture
100
10
286
386
486
8086
1
0.1
0.01
1970
1980
1990
2000
2010
2020
How do you get there?
3
Technology Scaling
GATE
SOURCE Xj
GATE
DRAIN
SOURCE
BODY
DRAIN
D
Tox
BODY
Leff
Dimensions scale
down by 30%
Oxide thickness
scales down
Doubles transistor
density
Faster transistor,
higher performance
Vdd & Vt scaling
Lower active power
Scaling will continue, but with challenges!
4
Technology Outlook
High Volume
Manufacturing
Technology Node
(nm)
Integration
Capacity (BT)
Delay = CV/I
scaling
Energy/Logic Op
scaling
2004
2006
2008
2010
2012
2014
2016
2018
90
65
45
32
22
16
11
8
2
4
8
16
32
64
128
256
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
ILD (K)
RC Delay
Metal Layers
Medium
High
~3
<3
1
1
1
6-7
7-8
8-9
Very High
Reduce slowly towards 2-2.5
1
1
1
1
1
0.5 to 1 layer per generation
5
The Leakage(s)…
90nm MOS Transistor
Ioff (na/u)
10000
45
1000
90
100
10
0.25
Gate
1
30
50
70
90
110
1.2 nm SiO2
130
50nm
1000
100
2X Tr Growth
1.5X Tr Growth
10
1
0.1
Gate Leakage (Watts)
SD Leakage (Watts)
Temp (C)
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1.E-02
1.E-03
1.5X
2X
Si
During Burn-in
1.4X Vdd
0.25u 0.18u 0.13u 90nm 65nm 45nm
0.25u 0.18u 0.13u 90nm 65nm 45nm
Technology
Technology
6
Power (W), Power Density (W/cm2)
Must Fit in Power Envelope
1400
10 mm Die
1200
1000
SiO2 Lkg
SD Lkg
Active
800
600
400
200
0
90nm 65nm 45nm 32nm 22nm 16nm
Technology, Circuits, and Architecture
to constrain the power
7
Solutions
Move away from Frequency alone to
deliver performance
 More on-die memory
 Multi-everywhere

–Multi-threading
–Chip level multi-processing
Throughput oriented designs
 Valued performance by higher level of
integration

–Monolithic & Polylithic
8
Leakage Solutions
Planar Transistor
Gate
1.2 nm SiO2
SiGe
SiGe
Tri-gate Transistor
Silicon substrate
Gate electrode
3.0nm High-k
Silicon substrate
9
Slow
Fast
Slow
High Supply
Voltage
Low Supply
Voltage
Active Power Reduction
Multiple Supply
Voltages
Throughput Oriented Designs
Vdd
Logic Block
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
10
Leakage Control
Body Bias
Stack Effect
Sleep Transistor
Vbp
Vdd
+Ve
Equal Loading
-Ve
Logic Block
Vbn
2-10X
5-10X
2-1000X
Reduction
Reduction
Reduction
11
Optimum Frequency
10
8
6
Sub-threshold
Leakage increases
exponentially
8
6
4
4
2
2
0
0
1 2 3 4 5 6 7 8 9 10
Relative Frequency
Process Technology
10
Optimum
10
Power
Efficiency
1 2 3 4 5 6 7 8 9 10
Relative Pipeline Depth
Pipeline Depth
Performance
8
6
4
Diminishing
Return
2
0
1 2 3 4 5 6 7 8 9 10
Relative Frequency (Pipelining)
Maximum performance with
• Optimum pipeline depth
• Optimum frequency
Pipeline & Performance
12
Memory Latency
CPU
Cache
Small
~few Clocks
Memory
Large
50-100ns
Memory Latency (Clocks)
1000
100
10
1
100
Assume: 50ns
Memory latency
1000
10000
Freq (MHz)
Cache miss hurts performance
Worse at higher frequency
13
Increase on-die Memory
Power Density (Watts/cm2)
100
Cache % of Total Area
100%
Logic
Memory
10
75%
Pentium® M
50%
Pentium® III
25%
486
Pentium®
Pentium® 4
0%
1
1u



0.5u
0.25u
0.13u
65nm

Large on die memory provides:
1. Increased Data Bandwidth & Reduced Latency
2. Hence, higher performance for much lower power
14
Multi-threading
100%
Thermals & Power Delivery
designed for full HW utilization
Performance
80%
Single Thread
Full HW Utilization
60%
1 GHz
40%
2 GHz
20%
ST
Multi-Threading
MT1
3 GHz
Wait for Mem
Wait for Mem
MT2
0%
100%
98%
96%
Wait
MT3
Cache Hit %
Multi-threading improves performance without
impacting thermals & power delivery
15
Single Core Power/Performance
5
Area X
Perf X
Increase (X)
4
3
Moore’s Law  more
transistors for advanced
architectures
2
1
0
Pipelined
3
S-Scalar
OOOSpec
Deep Pipe
Delivers higher peak
performance
Power X
Mips/W (%)
Increase (X)
2
But…
1
Lower power efficiency
0
Pipelined
-1
S-Scalar
OOOSpec
Deep Pipe
16
C1
C2
Cache
C3
C4
Relative Performance
Chip Multi-Processing
3.5
3
Multi Core
2.5
2
1.5
Single Core
1
1
•
•
•
•
•
2
3
Die Area, Power
4
Multi-core, each core Multi-threaded
Shared cache and front side bus
Each core has different Vdd & Freq
Core hopping to spread hot spots
Lower junction temperature
17
Dual Core
Rule of thumb
Voltage Frequency Power Performance
1%
1%
3%
0.66%
In the same process technology…
Cache
Core
Voltage = 1
Freq
=1
Area
=1
Power = 1
Perf
=1
Cache
Core
Core
Voltage = -15%
Freq
= -15%
Area
= 2
Power = 1
Perf
= ~1.8
18
Multi-Core
Power
Cache
4
Power = 1/4
Performance
Performance = 1/2
3
Large Core
C1
C2
Cache
C3
C4
2
2
1
1
4
4
3
3
2
2
1
1
Small
Core
1
1
Multi-Core:
Power efficient
Better power and
thermal management
19
Special Purpose Hardware
1.E+06
TCP/IP Offload Engine
MIPS
1.E+05
1.E+04
1.E+03
GP MIPS
@75W
TOE MIPS
@~2W
1.E+02
1995 2000 2005 2010 2015
2.23 mm X 3.54 mm,
260K transistors
Opportunities: Network processing engines
MPEG Encode/Decode engines, Speech engines
Special purpose HW provides best Mips/Watt
20
Performance Scaling
Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)
Performance
10
8
Serial% = 6.7%
N = 16, N1/2 = 8
16 Cores, Perf = 8
6
4
Serial% = 20%
N = 6, N1/2 = 3
6 Cores, Perf = 3
2
0
0
10
20
30
Number of Cores
Parallel software key to Multi-core success
21
From Multi to Many…
13mm, 100W, 48MB Cache, 4B Transistors, in 22nm
144 Cores
12 Cores
24 Cores
1
1
0.8
0.5
0.6
0.4
0.3
0.2
System Performance
1.2
25
20
15
10
5
0
Small
Med
0
Large
Relative Performance
Single Core
Performance
Large
Med
Small
30
TPT
One
App
Two
App
Four
App
Eight
App
22
Future Multi-core Platform
C
GP
C
GP
GPC
GPC
C
GP
SPC
GPC
SPC
C
SP
C
GP
C
GP
SPC
C
GP
C
GP
C
GP
C
GP
General Purpose Cores
Special Purpose HW
Interconnect fabric
Heterogeneous Multi-Core Platform
23
The New Era of Computing
1000000
Multi-Threaded, Multi-Core
100000
Multi Threaded
10000
Speculative, OOO
1000
MIPS
Era of
Thread &
Processor
Level
Parallelism
Super Scalar
100
10
286
0.1
0.01
1970
Era of
Instruction
Era of
Level
Pipelined
Architecture Parallelism
386
8086
1
1980
486
1990
Special
Purpose HW
2000
2010
Multi-everywhere: MT, CMP
24
Summary

Business as usual is not an option
–Performance at any cost is history

Must make a Right Hand Turn (RHT)
–Move away from frequency alone

Future Architectures and designs
–More memory (larger caches)
–Multi-threading
–Multi-processing
–Special purpose hardware
–Valued performance with higher integration
25