Transcript EE5900

Interconnect Optimizations

A scaling primer •

Ideal process scaling:

– Device geometries shrink by S (

= 0.7x)

• Device delay shrinks by s – Wire geometries shrink by s • • • •

R/

m

Cc/

m

:

r

/(ws.hs) = r/s 2

: (hs).

e

/(Ss) = Cc C/

m

R/

m

:

similar

doubles, C/

m

and Cc/

m

unchanged

w

s

S

s

G S

S

D

w h l l

s

h

s

Interconnect role • Short (local) interconnect – Used to connect nearby cells – Minimize wire C, i.e., use short min-width wires • Medium to long-distance (global) interconnect – Size wires to tradeoff area vs. delay – Increasing width  Capacitance increases, Resistance decreases

Need to find acceptable tradeoff - wire sizing problem

• “Fat” wires – Thicker cross-sections in higher metal layers – Useful for reducing delays for global wires – Inductance issues, sharing of limited resource

Cross-Section of A Chip

Block scaling • Block area often stays same – # cells, # nets doubles – Wiring histogram shape invariant • Global interconnect lengths don’t shrink • Local interconnect lengths shrink by s

Interconnect delay scaling •

Delay of a wire of length l :

t

int

= (rl)(cl) = rcl

2

(first order)

Local interconnects :

t

int

: (r/s

2 )(c)(ls) 2 = rcl 2

Local interconnect delay unchanged (compare to faster devices)

Global interconnects :

t

int

: (r/s

2 )(c)(l) 2 = (rcl 2

)/s

2

Global interconnect delay doubles – unsustainable!

Interconnect delay increasingly more dominant

Buffer Insertion For Delay Reduction

Analysis of Simple RC Circuit

i(t)

R

v T (t)

± C

v(t) R

i

(

t

) 

v

(

t

) 

v T

(

t

)

i

(

t

) 

RC

d

(

Cv

(

t

))

dt

C dv

(

t

)

dt dv

(

t

) 

v

(

t

)

dt

v T

(

t

) state variable Input waveform

Analysis of Simple RC Circuit Step-input response:

v 0 v 0 u(t) v 0 (1-e -t/RC )u(t) RC v

(

t

)

dv

(

t

 ) 

dt Ke

t v

(

t RC

)  

v v

0

u

0

u

(

t

(

t

) ) match initial state:

v

( 0 )  0 

K

v

0

u

(

t

)  0 output response for step-input:

v

(

t

) 

v

0 ( 1 

e

t RC

)

u

(

t

)

Delays of Simple RC Circuit • v(t) = v 0 (1 - e -t/RC ) -- waveform under step input v 0 u(t) • v(t)=0.5v

0  t = 0.69RC

– i.e., delay = 0.69RC

(50% delay) v(t)=0.1v

0 v(t)=0.9v

0   t = 0.1RC

t = 2.3RC

– i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd) • Commonly used metric T D = RC (= Elmore delay )

Elmore Delay Delay

Elmore Delay • Driver is modeled as R • Driver intrinsic gate delay t(B) • Delay =  all Ri  all Cj downstream from Ri Ri*Cj • Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2 • Elmore delay at n1 R(B)*(C1+C2) n1 n2 B

R(B) C1 R(w) C2

Elmore Delay • For uniform wire unit wire capacitance c unit wire resistance r x C • No matter how to lump, the Elmore delay is the same

u Delay for Buffer v C u C(b) Input capacitance Driver resistance Intrinsic buffer delay

Buffers Reduce Wire Delay

x/2 x/2 R rx/2 cx/4 cx/4 C R

t_unbuf = R( cx + C ) + rx( cx/2 + C ) t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + t b t_buf – t_unbuf = RC + t b – rcx 2 /4

rx/2 cx/4 cx/4 ∆t C x

Combinational Logic Delay Register Primary Input clock Combinational Logic Register Primary Output Combinational logic delay <= clock period

Buffered global interconnects: Intuition

l

Interconnect delay =

r.c.l

2 l 1 l 2 l 3 l n

Now, interconnect delay =  since

r.c.l

i 2

<

r.c.l

2

(l j 2 ) < (

j l ) 2 (where l =

j l )

(Of course, account for buffer delay also)

Optimal inter-buffer length • First order (lumped parasitic, Elmore delay) analysis

L

l

R d C g r,c

– On resistance of inverter – Gate input capacitance – Resistance, cap. per micron

• Assume

N

identical buffers with equal inter-buffer length

l

• For minimum delay,

T

N

R d

(

C g

cl rl C g

cl

/ 2   

L

 

rcl

/ 2  (

rC g

R d c

 (

l R d C g

  

dT dl

 0

L

  

rc

2 

R d C g

2

l opt

    0

l opt

 (

L

=

Nl)

2

R d C g rc

Optimal interconnect delay • Substituting

l opt

expression:

T opt

back into the interconnect delay 

L

  

rcl opt

 (

rC g

R d c

  1

l opt

(

R d C g

    

L

    

rc

2

R d C g rc

 (

rC g

R d c

 

R d C g

2

R d C g rc

    

T opt

L

 2

R d C g rc

 (

rC g

R d c

  Delay grows linearly with L (instead of quadratically)

Total buffer count •

80 70 60 50 40 30 20 clk-buf buf tot-buf 10 0 90nm 65nm 45nm 32nm

Ever-increasing fractions of total cell count will be buffers – 70% in 32nm

ITRS projections

Relative delay 100 250 180 Feature size (nm) 130 90 Gate delay (fanout 4) Local interconnect (M1,2) Global interconnect with repeaters Global interconnect without repeaters 10 65 45 32 1 0.1

Source: ITRS, 2003

Buffers Improve Slack slack min = -50 slack min = 50 RAT = Required Arrival Time Slack = RAT - Delay Decouple capacitive load from critical path RAT = 300 Delay = 350 Slack = -50 RAT = 700 Delay = 600 Slack = 100 RAT = 300 Delay = 250 Slack = 50 RAT = 700 Delay = 400 Slack = 300

Timing Driven Buffering Problem Formulation • Given – A Steiner tree – RAT at each sink – A buffer type – RC parameters – Candidate buffer locations • Find buffer insertion solution such that the slack at the driver is maximized

Candidate Buffering Solutions

Candidate Solution Characteristics • Each candidate solution is associated with –

v i

: a node –

c i

: downstream capacitance –

q i

: RAT v i c i is a sink is sink capacitance v is an internal node

Van Ginneken’s Algorithm Candidate solutions are propagated toward the source Dynamic Programming

Solution Propagation: Add Wire (v 2 , c 2 , q 2 ) x (v 1 , c 1 , q 1 ) • • • •

c 2 = c 1 q r: 2 = q 1 + cx – rcx 2 /2 – rxc 1

wire resistance per unit length

c:

wire capacitance per unit length

Solution Propagation: Insert Buffer (v 1 , c 1b , q 1b ) (v 1 , c 1 , q 1 ) • • • • •

c 1b = C b q 1b = q 1 – R b c 1

t b C b :

buffer input capacitance

R b :

buffer output resistance

t b :

buffer intrinsic delay 28

Solution Propagation: Merge (v, c l , q l ) (v, c r , q r ) • •

c merge = c l + c r q merge = min(q l , q r )

Solution Propagation: Add Driver (v 0 , c 0d , q 0d ) (v 0 , c 0 , q 0 ) •

q 0d = q 0 – R d c 0 = slack min

R d :

driver resistance • Pick solution with max

slack min

Example of Solution Propagation (v 3 , 5, 8) 2 Add wire (v 2 , 3, 16) Add wire 2 (v 1 , 1, 20) • • • r = 1, c = 1 R b = 1, C R d = 1 b = 1, t b = 1 v 1 Insert buffer (v 3 , 3, 8) v 1 (v 2 , 1, 12) Add wire v 1 v 1 slack = 3 Add driver slack = 5 Add driver

Example of Merging Left candidates Right candidates Merged candidates 32

Solution Pruning • Two candidate solutions –

(v, c 1 , q 1 )

(v, c 2 , q 2 )

• Solution 1 is inferior if –

c 1 > c 2

– and

q 1

: larger load

< q 2

: tighter timing

Pruning When Insert Buffer They have the same load cap C b , only the one with max q is kept

(1) (2) (3) 35 Generating Candidates From Dr. Charles Alpert

Pruning Candidates (3) (a) (b) Both (a) and (b) “look” the same to the source.

Throw out the one with the worst slack (4) 36

(4) Candidate Example Continued (5) 37

(5) Candidate Example Continued After pruning At driver, compute which candidate maximizes slack. Result is optimal.

38

Left Candidates Right Candidates 39 Merging Branches

40 Pruning Merged Branches Critical With pruning

41 Van Ginneken Example Buffer C=5, d=50 C=5, d=30 (45, 50) (5, 0) (20,100) (5, 70) Buffer C=5, d=30 (30,250) (5, 220) Wire C=15,d=200 C=15,d=120 (30,250) (5, 220) Wire C=10,d=150 (20,400) (20,400) (20,400)

Van Ginneken Example Cont’d 42 (45, 50) (5, 0) (20,100) (5, 70) (30,250) (5, 220) (5,0) is inferior to (5,70). (45,50) is inferior to (20,100) (30,10) (15, -10) Wire C=10 (20,100) (5, 70) (30,250) (5, 220) Pick solution with largest slack, follow arrows to get solution (20,400) (20,400)

Basic Data Structure Worse load cap (c 1 , q 1 ) (c 2 , q 2 ) (c 3 , q 3 ) Better timing Sorted list such that • c 1 < c 2 < c 3 • If there is no inferior candidates q 1 < q 2 < q 3

Prune Solution List (c 1 , q 1 ) Increasing c (c 2 , q 2 ) q 1 < q 2 ?

Y q 2 < q 3 ?

Y N N Prune 2 Prune 3 (c 3 , q 3 ) q 1 < q 3 ?

Y N Prune 3 (c 4 , q 4 ) q 1 < q 4 ?

q 2 < q 4 ?

q 3 < q 4 ?

N Prune 4 q 3 < q 4 ?

N Prune 4 44

Pruning In Merging Left candidates (c l1 , q l1 ) (c l2 , q l2 ) (c l3 , q l3 ) Right candidates (c r1 , q r1 ) (c r2 , q r2 ) (c (c (c l1 l2 l3 , q , q , q l1 l2 l3 ) ) ) (c (c r1 r2 , q , q r1 r2 ) ) q l1 < q l2 < q r1 < q l3 < q r2 Merged candidates (c l1 +c r1 , q l1 ) (c l2 +c r1 , q l2 ) (c l3 +c r1 , q r1 ) (c l3 +c r2 , q l3 ) (c l1 , q l1 ) (c l2 , q l2 ) (c l3 , q l3 ) (c (c l1 l2 , q , q l1 l2 ) ) (c l3 , q l3 ) (c r1 , q r1 ) (c r2 , q r2 ) (c (c r1 r2 , q , q r1 r2 ) ) 45

Van Ginneken Complexity • Generate candidates from sinks to source • Quadratic runtime – Adding a wire does not change #candidates – Adding a buffer adds only one new candidate – Merging branches additive, not multiplicative – Linear time solution list pruning • Optimal for Elmore delay model

Multiple Buffer Types 2 (v 2 , 3, 16) 2 (v 2 , 1, 12) (v 1 , 1, 20) v 1 • r = 1, c = 1 • R b1 = 1, C b1 = 1, t b1 = 1 • R b2 = 0.5, C b2 = 2, t b2 = 0.5

• R d = 1 (v 2 , 2, 14) v 1 v 1