Transcript PPT

EE4271 VLSI Design
Interconnect Optimizations
Buffer Insertion
Moore’s law
Twice the number of
transistors,
approximately every two
years, so double clock
frequency accordingly
Interconnects Dominate
Delay (psec)
300
250
Interconnect delay
200
150
100
Transistor/Gate delay
50
0
0.8
0.5
0.35 0.25
Technology generation (m)
Source: Gordon Moore, Chairman Emeritus, Intel Corp.
This is why Moore’s law is not true anymore.
3
Objectives
• What have we learned?
– Compute circuit delay on wires and gates
– Gate delay optimization
• What are we going to learn?
– Interconnect delay optimization: buffer
insertion
• Why reducing delay
• How to perform it
– This is the most important optimization in
circuit design
Why is this trend?
Delay (psec)
300
250
Interconnect delay
200
150
100
Transistor/Gate delay
50
0
0.8
0.5
0.35 0.25
Technology generation (m)
Source: Gordon Moore, Chairman Emeritus, Intel Corp.
5
A scaling primer
G
• Ideal process scaling:
S
– Device geometries shrink by S (= 0.7x)
D
• Device delay shrinks by s
w
S
h
– Wire geometries shrink by s
• Unit resistance R/ : r/(ws.hs) = r/s2
• Unit coupling capacitance
Cc/ : (hs)/(Ss)
• Resistance doubled, capacitance roughly
unchanged for unit length
• How about the change in wire length?
ws Ss
l
hs
ls
Technology scaling
• Global (long) interconnect lengths don’t
shrink
– Global interconnect link cells far apart
• Local (short) interconnect lengths shrink
by s
– Local interconnects link cells nearby
Interconnect delay scaling
• Delay of a wire of length l :
tint = (rl)(cl) = rcl2
(a quadratic function of length)
• Local interconnects :
tint : (r/s2)(c)(ls)2 = rcl2
– Local interconnect delay unchanged
• Global interconnects :
tint : (r/s2)(c)(l)2 = (rcl2)/s2
– Global interconnect delay doubled – unsustainable!
• Interconnect delay increasingly more dominant
Buffer Insertion For Delay
Reduction
Elmore Delay for Wire
unit wire capacitance c
x
unit wire resistance r
C
Elmore Delay for Buffer
u
v
C
Input capacitance
Driving resistance
Elmore Delay for A Circuit
• Delay = all Ri all Cj downstream from Ri Ri*Cj
• Elmore delay to n1 R(B)*(C1+C2)
• Elmore delay to n2 R(B)*(C1+C2)+R(w)*C2
n1
B
R(B)
C1 R(w)
n2
C2
Buffers Reduce Wire Delay
x/2
R
rx/2
cx/4 cx/4
x/2
C
R
rx/2
cx/4 cx/4
C
∆t
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C )
t_buf – t_unbuf = RC – rcx2/4
x
Buffered global interconnects:
Intuition
l
Interconnect delay = r.c.l2/2
l1
l2
l3
ln
Interconnect delay =  r.c.li2 /2 < r.c.l2 /2 (where l =  lj )
since  (lj 2) < ( lj )2
(Of course, we need to consider buffer delay as well)
Optimal Buffer Insertion on A
Wire
• Delay before buffer insertion = rcL2/2
L
…
…
Rd – On resistance of inverter
Cg – Gate input capacitance
r,c – unit resistance and capacitance
l
• Assume N identical buffers with equal inter-buffer length l (L = Nl)


T  N Rd (Cg  cl   rl (Cg  cl / 2
1


 L rcl / 2  (rC g  Rd c   (Rd Cg 
l


• For minimum delay,
dT
0
dl
 rc Rd C g 
L  2   0
lopt 
 2
lopt 
2Rd Cg
rc
Optimal interconnect delay
• Substituting lopt back into the interconnect delay
expression:
Topt


1
(Rd Cg 
 L rclopt / 2  (rC g  Rd c  
lopt




2 Rd C g
 L rc
 (rC g  Rd c  

rc




Rd C g 

2 Rd C g 

rc 

Topt  L 2 Rd C g rc  (rC g  Rd c 
Delay grows linearly with L (instead of quadratically)
Total buffer count
% cells used to buffer nets
80
70
60
clk-buf
buf
tot-buf
50
40
30
20
10
0
90nm 65nm 45nm 32nm
•
Ever-increasing fractions of total cell count will be buffers
– 70% in 32nm
– 25% is widely observed
ITRS projections
Relative
delay
100
Feature size (nm)
250
180
130
90
Gate delay
Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10
1
Source: ITRS, 2003
0.1
65
45
32
Exercise 1
• Given a wire of length 10 with r=2, c=2,
what is its delay?
• Given a buffer with Rd =10, Cg=20, after
optimally buffering the wire, what is the
delay?
• What if wire length is 100?
• Any conclusion?
Exercise 2
• Relationship with gate sizing
– If we can size the buffer, what is the best
buffer size?
– Let R0 denote the unit size buffer driving
resistance, and C0 denote the unit size buffer
input capacitance. Thus, Rd=R0/h and Cg=C0h
– What is best h leading to smallest delay?
Analogy
Analogy
• Advancing technology = period of city
expansion, more transistors = larger city
• Interconnects = streets
• Buffers = gas stations
• Signal delay (timing) = time to cross the
city
• Buffer insertion = gas station construction
Previous Result is Only Theoretical: Discrete Buffer
Locations
Candidate buffer locations
RAT: Required Arrival Time
RAT = 100
AT = 0
Wire delay = 80
RAT = 100
AT = 0
RAT = 20
Wire delay = 80
AT = 80
Slack: RAT - AT
RAT = 100
AT = 0
RAT = 20
Slack = 20
Wire delay = 80
AT = 80
Slack = 20
Minimizing circuit delay = maximizing RAT at driver = maximizing slack at driver
Motivation for Problem
Formulation
RAT = 300
AT = 350
Slack = RAT-AT= -50
slack = -50
RAT = Required Arrival Time
Slack = RAT - AT
slack = 50
Decouple capacitive
load from critical path
We need to maximum slack or RAT at driver
RAT = 700
AT = 600
Slack = 100
RAT = 300
AT = 250
Slack = 50
RAT = 700
AT = 400
Slack = 300
Timing Driven Buffering
Problem Formulation
• Given
– A Steiner tree
– RAT at each sink
– A buffer type
– RC parameters
– Candidate buffer locations
• Find buffer insertion solution such that the
slack (or RAT) at the driver is maximized
An Example for Buffer Insertion
2
2
(v3, 5, 8)
Add wire
Q
• r = 1, c = 1
(v1, 1, 20) • Rb = 1, Cb = 1
• Rd = 1
Add wire
(v2, 3, 16)
C
(v2, 1, 13)
v1
Insert buffer
(v3, 3, 9)
Add wire
v1
slack = 3 Add driver
v1
v1
slack = 6 Add driver
Candidate Buffering Solution
• Definition
• Each candidate
solution is
associated with
– vi: a node
– ci: downstream
capacitance
– qi: RAT
vi is a sink
ci is sink capacitance
v is an internal node
Van Ginneken’s Algorithm
Candidate solutions are
propagated toward the source
Solution Propagation: Add Wire
(v2, c2, q2)
•
•
•
•
x
(v1, c1, q1)
c2 = c1 + cx
q2 = q1 – rcx2/2 – rxc1
r: wire resistance per unit length
c: wire capacitance per unit length
Solution Propagation: Insert Buffer
(v1, c1b, q1b)
•
•
•
•
(v1, c1, q1)
c1b = Cb
q1b = q1 – Rbc1
Cb: buffer capacitance
Rb: buffer resistance
32
Solution Propagation: Add Driver
(v0, c0d, q0d)
(v0, c0, q0)
• q0d = q0 – Rdc0
• Rd: driver resistance
• Pick solution with max slack
Exercise
2
2
Unit Wire Cap = 5
Unit Wire Res = 3
Buffer C=5, R=1
Perform buffer insertion to maximize the slack at driver
2
(20,400)
Exponential Runtime
16
solutions
8
solutions
4
solutions
2
solutions
n candidate buffer locations lead to 2n solutions
Solution Pruning
• Two candidate solutions
– (v, c1, q1)
– (v, c2, q2)
• Solution 1 is inferior if
– c1  c2 : larger load
– and q1  q2 : tighter timing
An Analogy - 1
LOAD
Faster -> Smaller Delay -> Larger RAT
(since RAT = RAToutput - Delay)
Larger Load -> Larger Capacitance
An Analogy - 2
LOAD
LOAD
Faster & smaller load
(larger RAT, smaller
capacitance):
Good
Slower & larger load
(smaller RAT, larger
capacitance):
Inferior
END
An Analogy - 3
END
Who will be the winner?
Cannot tell at this moment,
so keep both of them.
An Analogy - 4
END
Who will be the winner?
Cannot tell at this moment,
so keep both of them.
Pruning When Insert Buffer
They have the same load cap Cb,
only the one with max q is kept
Generating Candidates
(1)
(2)
(3)
42
From Dr. Charles Alpert
Pruning Candidates
(3)
(a)
(b)
Both (a) and (b) “look” the same to the source.
Throw out the one with the worse slack
(4)
43
Candidate Example Continued
(4)
(5)
44
Candidate Example Continued
After pruning
(5)
At driver, compute which candidate maximizes
slack. Result is optimal.
45
Example
2
2
2
Unit Wire Cap = 5
Unit Wire Res = 3
Buffer C=5, R=1
(20,400)
(30,250)
(5, 220)
(40, 40)
(5, 0)
(15,160)
(5, 145)
46
(30,250)
(5, 220)
(20,400)
(20,400)
Example Cont’d
(40, 40)
(5, 0)
(15,160)
(5, 145)
(30,250)
(5, 220)
(20,400)
(5,0) is inferior to (5,145). (45,40) is inferior to (15,160)
(5,15)
(5,70)
(15,160)
(5, 145)
(30,250)
(5, 220)
Pick solution with largest slack, follow arrows to get solution
47
(20,400)
Exercise
• Without pruning, there will be exponential
number of candidate solutions (i.e., given
n candidate buffer locations, there will be
2n solutions). With pruning, how many
solutions will we have?
Exercise
• Continue the following buffer insertion
process. Assume that all partial candidate
buffering solutions are as shown.
2
Unit Wire Cap = 1
Unit Wire Res = 1
Buffer C=1, R=1
2
(10,40)
(8,50)
(5,10)
(15,40)
(7,10)
(9,30)
(12,20)
Summary
• Interconnect delay increases with
technology scaling
• Linear interconnect delay with buffer
insertion
• Buffer insertion with candidate buffer
locations
• Pruning for accelerating buffer insertion
technique