UC Davis Seminar - University of Sydney

Transcript UC Davis Seminar - University of Sydney

Microprocessor and DSP
Technologies for the
Nanoscale Era
Seminar 1
Ram Kumar Krishnamurthy
Microprocessor Research Labs
Intel Corporation, Hillsboro, OR
[email protected]
Intel
July 5, 2005
1
Labs
About Circuits Research Lab
• Established 1996
• Belongs under Microprocessor Technology Labs
• Located in Hillsboro, Oregon, USA (primary) and
Bangalore, India
• 75 researchers
• Charter:
• High-performance & low-power digital circuits
• Off-chip I/O signaling circuits
• Power delivery circuits
• >50 patents, >25 papers per year
Intel
2
Labs
Motivation: Higher performance at
lower power and cost
1000000
100000
10000
MIPS
1000
100
10
1
Pentium® 4 Architecture
Pentium® Pro Architecture
Pentium® Architecture
486
386
286
8086
0.1
0.01
1970
1980
1990
2000
2010
Strong demand for > 1 TIPS performance beyond this decade
How do you get there?
3
Our Research Agenda Outlook
2004
2006
2008
2010
2012
2014
2016
2018
Technology Node
(nm)
90
65
45
32
22
16
11
8
Integration
Capacity (BT)
0.5
1
2
4
8
16
32
64
Delay = CV/I
scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
Energy/Logic Op
scaling
Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Variability
ILD (K)
RC Delay
Metal Layers
Medium
High
~3
<3
1
1
1
6-7
7-8
8-9
Very High
Reduce slowly towards 2-2.5
1
1
1
1
1
0.5 to 1 layer per generation
FCRP(MARCO)
Internal
4
University
Intel’s Research Focus
Technology Leadership
1
1000
Gate Length
Industry
0.1
0.01
1990
nm
Intel
2000
100
10
2010
Complete solution stack
Technology
5
Arch & Design
Platforms
Software
Architectures & Designs
Back End
Server
Server
Desktop
Mobile
Handheld
Family
Itanium®
Itanium ®
Xeon ®
Pentium ®
Celeron
Centrino ®
Pentium ®
Xscale ®
Architecture
IA64, VLIW
IA64/ IA32
IA32
IA32
ARM
Word
64 bit
64 bit Itanium
32 bit Xeon
32 bit
32 bit
32 bit
Address
Space
Huge
Huge/4 GB
4 GB
4 GB
4 GB
Cache
6 MB
6 MB, 2 MB
1 MB
1 MB
512 KB
Performance
High
High
High
Medium
Low
Power
~130W
~100 W
< 100 W
~25 W
< 1W
Power Metric
Cost
Watts/sq ft
Watts/cu ft
High
High
Watts
Med
Watt-hours
Battery Life
Med
Low
Our research agenda addresses all these platforms
6
Is Transistor a Good Switch?
I=0
I≠0
On
I=∞
I=0
I = 1ma/u
I≠0
Off
I=0
7
I≠0
Sub-threshold Leakage
Sub-threshold Leakage
10000
45
MOS Transistor Characteristics
Ids
(log)
Exponential
Increase in Ioff
1000
Ioff (na/u)
DVt
90
100
10
0.25
1
30
Vgs
50
70
90
110
130
Temp (C)
SD Leakage (Watts)
1000
100
2X Tr Growth
1.5X Tr Growth
10
1
0.1
0.25u 0.18u 0.13u 90nm 65nm 45nm
Technology
8
Transistors will not be
switches, but
dimmers
Leakage Power
Leakage Power
(% of Total)
50%
Must stop
at 50%
40%
30%
20%
10%
0%
1.5
1
0.7 0.5 0.35 0.25 0.18 0.13 0.09 0.07 0.05
Technology (m)
A. Grove, IEDM 2002
Leakage power limits Vt scaling
9
Clock
Pk0
INV_out
I Leak
Dyn_out
M1j
M11
M2j
M21
M1K
M2K
Keeper / pulldown ratio
High Leakage  Impacts Functionality
1.6
Sub-70nm
1.2
0.8
0.4
0
1X
3X 5X
10X
20X
Subthreshold + gate leakage
M. Anders, R. Krishnamurthy et al, 2001 Symp. VLSI Circuits

Sub-65nm Dynamic Circuit Active Leakage Tolerance:
 Cache, RF, Arrays, Bitlines most affected



10
Keeper sizes > 50% of pulldown strength
High contention  degraded performance
Slow keeper shutoff  high short-circuit power
Power Will be the Limiter
1000
100000
Pentium® 4 proc
100
1 Billion
Transistors
10
Million
Transistors 1
386
10000
Pentium® 4 proc
15-30 GHz
1000
MHz 100
Pentium® proc
386
0.1
8086
1
0.01
8080
0.001
1970
Pentium® proc
8086
10
8080
0.1
1980
1990
2000
2010
2020
1970
1980
1990
2000
2010
2020
1B transistor integration capacity will exist
1000
1000000
100000
Pentium® 4 proc
10000
1 TIPS
100
10
1
386
1000's of
Watts?
Power
(Watts) 10
1000
MIPS
Pentium® 4 proc
100
Pentium® proc
Pentium® proc
8086
1
8086
386
0.1
0.01
1970
8080
1980
1990
2000
2010
2020
Applications will demand TIPS performance
0.1
1970
8080
1980
1990
2000
2010
2020
But the Power…
Challenge: Highest performance in the power envelope
11
Power Trend
Cooling Capacity Of Conventional System
100
Pentium® 4 processor
Power (W)
Pentium® II processor
“Business As Usual”
is Not an Option
Pentium® processor
10
486
386
1
1985
1990
1995
2000
C scales by 30% per generation…
…but Vcc scales by 10-15% only!
Must maintain or reduce power in future
12
Gate Oxide is Near Limit
CoSi2
Si3N4
70 nm
Gate Leakage (Watts)
130nm Transistor
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
1.E-02
1.E-03
1.5X
2X
During Burn-in
1.4X Vdd
0.25u 0.18u 0.13u 90nm 65nm 45nm
Technology
Poly Si Gate
Electrode
1.5 nm
Gate
Oxide
Si Substrate
13
Intel’s High K leadership
is crucial for the
industry
Power Density Will Get Even Worse
10,000
1,000
Power Density
(W/cm2)
100
8086
10 4004
8008 8085
386
286
486
8080
1
’70
’80
’90
Pentium®
processors
’00
Need to Keep the Junctions Cool
• Performance (Higher Frequency)
• Lower leakage (Exponential)
• Better reliability (Exponential)
Pat Gelsinger, ISSCC 2001
14
’10
Slow
Fast
Slow
High Supply
Voltage
Low Supply
Voltage
Active Power Reduction
Multiple Supply
Voltages
Replicated Designs
Vdd
Logic Block
15
Vdd/2
Freq = 1
Vdd = 1
Logic Block
Throughput = 1
Power = 1
Logic Block
Area = 1
Pwr Den = 1
Freq = 0.5
Vdd = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Den = 0.125
Need high-speed multi-supply level converter techniques
Leakage Control
Body Bias
Stack Effect
Sleep Transistor
Vbp
Vdd
+Ve
Equal Loading
-Ve
Logic Block
Vbn
2-10X reduction
2-1000X reduction
2-200X reduction
16
Need low leakage and leakage tolerant techniques
Number of paths
Number of paths
Number of paths
Dual Vt Design for Active Leakage Reduction
17
Technology provides two Vt
 High Vt with nominal Ioff (lower performance)
 Low Vt with ~10X higher loff (higher performance)
High Vt
Delay
Low Vt
Employing high Vt everywhere yields lower
performance, and lower leakage (1X)
Employing low Vt everywhere yields higher
performance, but higher leakage (10X)
Delay
Logic path between latch boundaries
Dual Vt
Delay
Selective usage of low and high Vt yields
higher performance, yet low leakage
between 1X, and <<10X
C1
C2
Cache
C3
C4
Relative Performance
Chip Multi-Processing
3.5
3
CMP
2.5
2
ST
1.5
1
1
2
3
Die Area, Power
•
•
•
•
•
18
Multi-core, each core Multi-threaded
Shared cache and front side bus
Each core has different Vdd & Freq
Core hopping to spread hot spots
Lower junction temperature
4
Memory Latency
CPU
Cache
Small
~few Clocks
Memory
Large
50-100ns
Memory Latency (Clocks)
1000
100
10
1
100
Assume: 50ns
Memory latency
1000
10000
Freq (MHz)
Cache miss hurts performance
Worse at higher frequency
Need power efficient high-speed I/O techniques
19
Increase on-die Memory
Power Density (Watts/cm 2)
100
100%
Cache % of
full chip
area
80%
Logic
60%
Memory
?
Pentium
®4
10
40%
20% PentiumPentiumPentium
Pro
II
Pentium
III & 4
Pentium
III
0%
1
m m m
m
m
m
m m m m m
Large on die memory provides:
1. Increased Data Bandwidth & Reduced Latency
2. Hence, higher performance for much lower power
20
Special Purpose Hardware Acceleration
TCP Offload Engine
1.E+06
MIPS
1.E+05
1.E+04
1.E+03
GP MIPS
@75W
TOE MIPS
@~2W
1.E+02
1995 2000 2005 2010 2015
2.23 mm X 3.54 mm, 260K transistors
Opportunities for acceleration:
Network processing engines
MPEG Encode/Decode engines
Speech engines
Wireless communication/baseband
Special purpose HW—Best MIPS/Watt
21
Energy-efficient Data-path Circuits
Cache
Processor
thermal
map
 ALUs:
Temp
(oC)
Execution
core
Integer
and FP
ALUs
and
MACs
performance and peak-current limiters
 High activity  thermal hotspots
 Goal: high-performance energy-efficient design
22
130nm 9GHz 32-bit Integer ALU (ISSCC’02)
Input
FIFO
Output
FIFO
Misc
Clock
RF
ALU
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)
BB
Ctl
450
400
Die Size
Process
Interconnect
Transistors
Maximum V CC
1.61 x 1.44 mm
130nm CMOS
1 poly, 6 metal
160K
1.5 V
Power (mW)
350
300
Design target:
6.5GHz at 120mW
250
200
150
100
M. Anders, R. Krishnamurthy et al,
Intl. Solid-state Circuits Conf. 2002 &
IEEE Journal of Solid-state Circuits 11/02
23
50
0
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)
100
90
80
70
60
50
40
30
20
10
0
Leakage Power (mW)
Scan
Ctl
F
(GHz)
max
32-bit integer exec core
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
S. Mathew, R. Krishnamurthy et al,
Intl. Solid-state Circuits Conf. 2004 &
IEEE Journal of Solid-state Circuits 01/05
Upper-order
32-bit ALU
Clock Generator
and Drivers
Lower-order
32-bit ALU
90nm 7GHz 64-bit Integer ALU (ISSCC’04)
I/O Circuits
Process
90nm Dual-Vt CMOS, 7
Metal
Die area
0.474mm2
64-bit ALU layout area
0.073mm2
Total transistor count
6100
64-bit ALU average switching power (a=0.3)
89mW at 4GHz, 1.3V, 25oC
64-bit ALU active leakage power
9.6mW at 1.3V, 25oC
64-bit ALU maximum frequency
7GHz at 2.1V, 25C
32-bit ALU average switching power (a=0.3)
71mW at 7GHz, 1.3V, 25oC
32-bit ALU active leakage power
4.4mW at 1.3V, 25oC
64-bit ALU die microphotograph and measured performance summary
• 7GHz single-cycle 64-bit integer ALU (measured in 90nm CMOS)
• Simultaneous 9GHz single-cycle 32-bit integer ALU mode
• Fastest reported single-cycle 64-bit integer ALU performance
24
S. Hsu, R. Krishnamurthy et al,
Intl. Solid-state Circuits Conf. 2005
Registers
Clock
Generator
and Drivers
R-PLA
90nm 1GHz 9mW 16*16b Multiplier (ISSCC’05)
16x16b
I/O Circuits
Multiplier
Process
90nm Dual-Vt CMOS
Die area
0.474mm2
16b Multiplier and PLA layout area
0.03mm2
16b Multiplier worst-case power
9mW at 1GHz, 1.3V, 50oC (nominal)
16b Multiplier active leakage power
540μW at 1.3V, 50oC (nominal)
16b Multiplier peak performance
1.5GHz, 32mW at 1.95V, 50oC
16b Multiplier low-voltage mode performance
50MHz, 79μW at 0.57V, 50oC
Reconfigurable PLA peak performance
2.3GHz, 4.2mW at 1.3V, 50°C
Reconfigurable PLA worst-case power
2mW at 1GHz, 1.3V, 50oC (nominal)
Stand-by mode power
75μW (7X reduction vs. active leakage)
16*16-bit Multiplier die microphotograph and measured performance summary
• 1GHz single-cycle 16*16-bit DSP multiplier (measured in 90nm CMOS)
• Reconfigurable PLA control engine
• 9pJ/Op or 110GOPS/Watt
• Highest reported GOPS/Watt for single-cycle 16-bit multiply
25
32-bit ALU architecture
Mux control Shift control
5:1
Mux
Adder core
6:1 Mux
External
operands
6:1 Mux
External
operands
O/p
Mux
Sum
2:1
Mux
Mux control Sign control
Loopback bus
Multiple ALUs clustered together in the execution
core High power density
26
Full-Adder Intro
A
Cin
B
Full
adder
Sum
27
Cout
The Binary Adder
A
Cin
B
Full
adder
Cout
Sum
S = A  B  Ci
= ABC i + ABC i + ABCi + ABCi
C o = AB + BC i + ACi
28
The Ripple-Carry Adder
A0
B0
Ci,0
A1
B1
Co,0
FA
A2
B2
Co,1
A3
B3
Co,2
Co,3
FA
FA
FA
S1
S2
S3
(= Ci,1)
S0
Worst case delay linear with the number of bits
td = O(N)
tadder = (N-1)tcarry + tsum
Goal: Make the fastest possible carry path circuit
29
Static CMOS Full Adder
VDD
VDD
A
Ci
A
B
B
A
B
B
Ci
A
X
Ci
VDD
Ci
S
A
Ci
A
B
B
VDD
A
B
Co
28 Transistors
30
Ci
A
B
Carry Look-ahead
Sumi= Ai Bi  Carryi-1
Carryi = AiBi + (Ai+Bi)Carryi-1
Intel
Labs
Partial Sum
Sumi= Ai Bi  Carryi-1
Carryi = AiBi + (Ai+Bi)Carryi-1
Intel
Labs
Partial Sum
Sumi= Ai Bi  Carryi-1
Carryi = AiBi + (Ai+Bi)Carryi-1
Generate
Propagate
Intel
Labs
Partial Sum
Sumi= Ai Bi  Carryi-1
Carryi = AiBi + (Ai+Bi)Carryi-1
Propagate
Generate
Carryi =
Gi
+
Pi  Carryi-1
Intel
Labs
High-performance Adders:
Kogge Stone
1
2
3
4
5
6
7
Even PG Gen. CM1 CM2 CM3 CM4 CM5 XOR Sumeven
input bits
Sumodd
Odd
XOR
CM1
CM2
CM3
CM4
CM5
PG Gen.
input bits
GG=Gi+PiGi-1
GP=PiPi-1
 Generate
all 32 carries:
Full-blown binary tree  energy-inefficient
 # Carry-merge stages = log2(32)  5 stages

35
XOR Carry-merge gates PG
Kogge-Stone Adder
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Critical path = PG+5+XOR = 7 gate stages
 Generate,Propagate fanout of 2,3
Energy
 Maximum interconnect spans 16b inefficient

36
Sparse-tree Adder Architecture
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
C27
C23
C19
C15
C11
C7
C3
every 4th carry in parallel
 Side-path: 4-bit conditional sum generator
 73% fewer carry-merge gatesenergy-efficient
 Generate
37
Non-critical Sum Generator
Pi+3,Gi+3
Pi+2 ,Gi+2
Gi+1
Pi+1
1
2:1
Sumi+3
XOR
XOR
2:1
Sumi+2
CM
CM
XOR
XOR
2:1
Sumi+1
2:1
Sumi
Sumi ,0
CM
XOR
CM
0
Sumi ,1
CM
XOR
CM
Pi
Carry
Non-critical path: ripple carry chain
 Reduced area, energy consumption, leakage
 Generate conditional sums for each bit
 Sparse-tree carry selects appropriate sum

38
Adder Core Critical Path
clk
Adder
PG
Inputs
clk3
clk2
GG1
GG3
GG7
GG27
GG15
C27
Single-rail dynamic sparse-tree path
Sum31_0
Sum31
CM0
Latch CM1 XOR
clk
Static sum generator
Sum31_1
path: 7 gate stages  same as KS
Sparse-tree: single-rail dynamic
Exploit non-criticality of sum generator
Convert to static logicSemi-dynamic design
Critical
39
Sparse-tree Architecture
 Performance
impact: (20% speedup)
33-50% reduced G/P fanouts
80% reduced wiring complexity
30% reduction in maximum interconnect
 Power
impact: (56% reduction)
73% fewer carry-merge gates
50% reduction in average transistor size
40
130nm CMOS, 1.2V, 110oC
80
60
Dynamic Kogge-Stone
40
20
20%
4GHz
Design
0
140
160
Semi-dynamic Sparse-Tree
180
200
220
Delay (ps)
240
20% speedup over Kogge-Stone
 56% worst-case energy reduction
 Scales with activity factor

41
100
56%
Worst-case Energy (pJ)
Energy-delay Space
260
280
Average Energy (pJ)
Semi-dynamic Design
40
Dynamic
Kogge-Stone
30
71%
20
Semi-dynamic
Sparse-Tree
10
0
 Static
0
0.1
0.2
0.3
Activity factor
0.4
0.5
sum generators : low switching activity
 71% lower average energy at 10% activity
42
So, How Do We Get There?
1000000
Multi-Threaded, Multi-Core
100000
Multi Threaded
10000
MIPS
1000
10
0.1
0.01
1970
43
Super Scalar
100
1
Era of
Thread &
Processor
Level
Parallelism
Speculative, OOO
286
8086
486
Era of
Instruction
Era of
Level
Pipelined
Architecture Parallelism
386
1980
1990
Special
Purpose HW
2000
2010
Significant Challenges Ahead
Can only be solved with joint industry-university
collaboration
Thank You for Your Attention
Q&A
Our publications can be found in:
•IEEE Intl. Solid-State Circuits Conference, 2001•IEEE Journal of Solid-State Circuits, 2001•Symposium on VLSI Circuits, 1999•Intl. Symposium on Low-power Design, 1999•Custom Integrated Circuits Conference, SOCC, etc., 199944
Backup
45
Optimized First-level Carry-merge
Conditional Carry for Cin=0
0
CM
Pi
Cin=0
Gi
Pi
 Carry-merge
C#_0i
Gi
C#_0
Gi
stage reduces to inverter
 Conditional carry_0 = Gi#
46
Optimized First-level Carry-merge
1
Conditional carry for Cin=1
CM
Pi
Cin=1
Pi
C#_1
Gi
Pi
 Pi
47
Gi
C#_1
Gi
Ai
0
0
1
1
Bi Pi Gi
0 0 0
1 1 0
0 1 0
1 1 1
& Gi correlated
 Conditional carry_1 = Pi#
C#_1
1
0
0
0
Pi
C#_1
Optimized Sum Generator
Pi+2 ,Gi+2
Pi+3,Gi+3
Gi+1
Pi+1
Pi
Optimized 1st-level
carry-merge
XOR
XOR
2:1
Sumi+3
 Optimized
48
XOR
XOR
2:1
2:1
2:1
Sumi+2
Sumi+1
Sumi
non-critical path: 4 stages
Sumi ,0
CM
XOR
CM
Sumi ,1
CM
XOR
CM
Carry

UC Davis Seminar - University of Sydney

Transcript UC Davis Seminar - University of Sydney

Directory