MIT 6.375 Lecture 01

Download Report

Transcript MIT 6.375 Lecture 01

Physical Design - 1
RP
RW
Cd
CW/2
CW/2
Cg
Arvind
Computer Science & Artificial Intelligence Lab
Massachusetts Institute of Technology
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-1
6.375 Standard Cell
Design Flow
Bluespec SystemVerilog source
Bluespec Compiler
Blueview
Verilog 95 RTL
C
Bluespec C sim
Cycle
Accurate
Verilog sim
VCD output
Legend
files
Bluespec tools
3rd party tools
March 14, 2008
RTL synthesis
gates
Debussy
RTL
choices
Visualization
How do
affect resulting physical
http://csg.csail.mit.edu/6.375/
L15-2
Metrics for Chip “Quality”
Area

Size affects manufacturing and packaging costs
Performance

Does chip meet market performance goals?
Power


March 14, 2008
Peak power affects packaging cost (current supply,
heat removal)
Energy usage affects battery life
http://csg.csail.mit.edu/6.375/
L15-3
Iron Law of Performance
Operations Clock Cycles
Performance 

Clock Cycle
Second
Concurrency in
RTL Design
Clock Frequency of
Physical Design
These are not independent
parameters!
Clock frequency set by delay of circuit
components in critical path
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-4
What is synthesis ?
Synthesis tools (e.g., Design Compiler) coverts
RTL into gate level netlist given a gate library

infer logic and state elements
 Rather straightforward unless the language semantics
complicate it

perform technology-independent optimizations
 logic simplification, state assignment, …


map elements to the target technology
perform technology-dependent optimizations
 multi-level logic optimization, choose gate strengths to
achieve speed goals, …
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-5
Logic Synthesis
a
b
c
assign z = (a & b) | c;
a
// dataflow
assign z = sel ? a : b;
b
z
1
0
a
b
c
a
sel
b
z
z
z
sel
As a default + is
implemented as a ripple
carry editor
wire [3:0] x,y,sum;
wire cout;
assign {cout,sum} = x + y;
sum[0]
0
full
adder
sum[1]
full
adder
sum[2]
sum[3]
full
adder
full
adder
cout
x[0]y[0] x[1]y[1] x[2]y[2] x[3]y[3]
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-6
Technology-independent
optimizations
Two-level boolean minimization

Quine-McCluskey
Optimizing finite state machines


look for an equivalent FSM that has fewer states
Choose an FSM state encodings that minimizes the
size of state storage + size of logic to implement
next state and output functions).
None of these operations is completely isolated from the
target technology. But experience has shown that it’s
advantageous to reduce the size of the problem as much as
possible before starting the technology-dependent
optimizations
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-7
Mapping to target technology
Problem statement: find an “optimal” mapping of this circuit:
Into this library:
Popular approach: DAG covering (K. Keutzer)
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-8
A library of gates
8
13
13
10
11
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-9
Possible implementations
Is there a systematic way
to arrive at the optimal
answer?
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-10
Use dynamic programming!
Optimal cover for a tree consists of a best match at the
root of the tree plus the optimal cover for the sub-trees
starting at each input of the match.
Best cover for
this match uses
best covers for P,
X&Y
X
Y
Complexity: O(N)
we only need to consider a
best-cost match at the root of
the tree (constant time in the
number of matched cells), plus
the optimal cover for the
subtrees starting at each input
to the match (constant time in
the fanin of each match)
March 14, 2008
Z
P
Best cover for this
match uses best
covers for P & Z
http://csg.csail.mit.edu/6.375/
L15-11
Optimal tree covering
example
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-12
Example
March 14, 2008
cont.
http://csg.csail.mit.edu/6.375/
L15-13
Example
cont.
Our final answer matches our
earlier intuitive cover
Refinements: timing
optimization incorporating
load-dependent delays,
optimization for low power
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-14
DAG Covering
2-input NAND gates
+ inverters
Represent input netlist in normal form
(“subject DAG”)
Represent each library gate in normal form
(“primitive DAGs”).
Goal: find a minimum cost covering of the
subject DAG by the primitive DAGs.


March 14, 2008
If the subject and primitive DAGs are trees, use
dynamic programming for finding the optimum cover
Partition subject DAG into a forest of trees (each
gate with fanout > 1 becomes root of a new tree),
generate optimal solutions for each tree, stitch
solutions together
http://csg.csail.mit.edu/6.375/
L15-15
Technology-dependent
optimizations
Additional library components: more complex
cells may be slower but will reduce area for logic
off the critical path.
Load buffering: adding buffers/inverters to
improve load-induced delays along the critical
path
Resizing: Resize transistors in gates along
critical path
Retiming: change placement of latches/registers
to minimize overall cycle time
Increase routability over/through cells: reduce
routing congestion.
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-16
You are here!
Verilog
Gate
netlist
Logic Synthesis
• HDL logic
• map to target library
• optimize speed, area
Place & route
Mask
•
•
•
•
create floorplan blocks
place cells in block
route interconnect
insert buffers to over come
loading and wire delays
• insert Clock & power distribution
networks
• optimize (iterate!)
Next time
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-17
What determines clock cycle
Fan-in of gates
Fan-out of gates
Wire lengths
Combinational
logic
clock
March 14, 2008
Set up and hold times …
http://csg.csail.mit.edu/6.375/
L15-18
Which gate topology and
transistor sizing is optimal?
Given a logic function, there are many
possible logic gate topologies and
transistor sizings.
1. What is the optimal transistor sizing?
2. What is the optimal number of logic
stages?
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-19
Basic CMOS Components
Gates
Transistors
Wires
output
input0
input1
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-20
FET = Field-Effect Transistor
A four terminal device (gate, source, drain, bulk)
gate
inversion
happens here
Surface of wafer
Source
diffusion
Eh
Ev
Drain
diffusion
bulk
Reverse side of wafer
Inversion: A vertical field creates a channel between
the source and drain.
Conduction: If a channel exists, a horizontal field
causes a drift current from the drain to the source.
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-21
RC modeling of delay in
MOSFET transistors
Width
Vout
Vin
Cgate
Cdrain
Reff
Length
Increase Width (W)  Increase current 
Decrease Reff
Increase Length (L)  Decrease current 
Increase Reff
Cgate proportional to (W x L) and Cdrain
proportional to W
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-22
The most basic CMOS gate
is an inverter
VDD
WP/LP
Vin
2α
PMOS
Vout
WN/LN 1α
A
Y
NMOS
GND
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-23
RC model for an inverter
Reff
Vin
Vout
Vin
Vout
Cg
Reff
Cd
Reff = Reff,N = Reff,P
Cg = Cg,N + Cg,P
Cd = Cd,N + Cd,P
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-24
Charging time (0  1)
Reff
Vout
Vin = “0”
Cg
Reff
Cd
CL
Charge RC Time Constant
TPLH = Reff x ( Cd + CL )
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-25
Discharging time (1  0)
Reff
Vout
Vin = “1”
Cg
Reff
Cd
CL
Discharge RC Time Constant
TPHL = Reff x ( Cd + CL )
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-26
Larger gates are faster
decrease Reff (but increase Cd!)
Process gen = 0.25μm
Supply voltage = 5V
Min width NMOS = 0.5μm
Param Value Units
Cd,N/μm 1.42 fF/μm
2
2
1
1
Cd = (0.5x1.42) + (1x2.40) = 3.11 fF
CL = (0.5x1.55) + (1x1.48) = 2.26 fF
Cd+CL = 5.37 fF
TPLH = 2.2 x (10.83/1) x 5.37 = 128ps
TPHL = 2.2 x (4.93/0.5) x 5.37 = 116ps
Cd,P/μm 2.40 fF/μm
Cg,N/μm 1.55 fF/μm
Double size of driver
Cg,P/μm 1.48 fF/μm
Reff,N x
kΩ/μ
4.93
μm
m
Reff,P x
kΩ/μ
10.83
μm
m
4
2
2
1
Cd = (1x1.42) + (2x2.40) = 3.66 fF
CL = (0.5x1.55) + (1x1.48) = 2.26 fF
Cd+CL = 5.92 fF
TPLH = 2.2 x (10.83/2) x 5.92 = 70.5ps
TPHL = 2.2 x (4.93/1) x 5.92 = 64.2ps
Ignores the fact that previous
gate now must drive a bigger
gate capacitance!
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-27
Bigger gates: NAND, NOR
NAND Gate
A
B
NOR Gate
A
B
(A.B)
(A+B)
A
(A.B)
B
B
(A+B)
A
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-28
Unit-less delay (d) of gates with
equal drive strength (Reff)
4
4
2
4
10
8
4
10
4
Inverter
delay = 2.67
NAND
delay = 3.67
8
2
2
10
NOR
delay = 3.67
Less parasitic drain
capacitance (Cd)
loading output
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-29
Unit-less delay (d) of gates with
similar area
2.5
6
3
10
4
2.5
2.5
10
1
2.5
Inverter
delay = 2.11
4
NAND
delay = 4.67
1
10
NOR
delay = 5.33
PMOS worse than
NMOS, series path is
limiter
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-30
Optimal sizing and delays
for example topologies
Topology A
4/3
2
4/3
2
4/3
2
Topology B
4/3
2
2
4
Topology C
2
4
10/3
8
1
5/3
2
5/3
2
4/3
2
1
March 14, 2008
5/3
2
G
N P
DOPT
Optimal delay
for output
loading H
H=1 H=12
A 2.96 4
7 4(2.96H)1/4+7 12.25 16.77
B 3.33 2
6 2(3.33H)1/2+6 9.65 18.64
C 3.33 2
9 2(3.33H)1/2+9 12.65 21.64
[ For more explanation of how these numbers were
derived, see Logical Effort link in lab handout ]
http://csg.csail.mit.edu/6.375/
L15-31
How many stages of inverters
are required to drive a load?
Cin
…
Cout
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-32
A Lumped  model of a
wire
Rdriver
Rw
Cw/2
Delay  R driver
Cload
Cw/2
Cw
 Cw


 R driver  R w   
 Cload 
2
 2

Rw is lumped resistance of the wire
Cw is lumped capacitance;

March 14, 2008
Partition half of Cw at each end
http://csg.csail.mit.edu/6.375/
L15-33
Estimate the rise time of
node A
Process gen = 0.25μm
Supply voltage = 5V
Min width NMOS = 0.5μm
Param
Value Units
Cd,N / μm 1.42 fF/μm
16
Metal 2 wire
(250µm x 0.250µm)
2
A
1
8
Cd,P / μm 2.40 fF/μm
Cg,N / μm 1.55 fF/μm
RP
RW
Cg,P / μm 1.48 fF/μm
CA,M2 /
μm2
0.016 fF/μm2
CL,M2 / μm 0.084 fF/μm
Reff,N x
μm
Reff,P x
μm
RM2 / sq
March 14, 2008
4.93 kΩ/μm
10.83 kΩ/μm
0.07
Ω/sq
Cd
CW/2
CW/2
Cg
Cg = (0.5 x 1.55) + (1 x 1.48) = 2.26 fF
Cd = (4 x 1.42) + (8 x 2.40) = 24.88 fF
Rp = 10.83/8 = 1.35 kΩ
Rw = (250 / 0.25) x 0.07 = 70 Ω
Cw = ((250 x 0.25 ) x 0.0016)+(250 x 0.084) = 21.14 fF
TPLH = 2.2 x (1350 x (21.14/2 + 24.88)
+ (1350 + 70) x (21.14/2 + 2.26) ) = 66ps
http://csg.csail.mit.edu/6.375/
L15-34
Adding buffers
Process gen
= 0.25μm
Supply voltage = 5V
Min width NMOS
= 0.5μm
Param
Value Units
Cd,N / μm
1.42
fF/μm
Cd,P / μm
2.40
fF/μm
Cg,N / μm
1.55
fF/μm
Cg,P / μm
1.48
fF/μm
CA,M2 /
μm2
Metal 2 wire
(250u x 0.250u)
16
8
2
1
A
Should we have a few big stages or
many small stages?
0.016 fF/μm2
CL,M2 / μm 0.084 fF/μm
Reff,N x μm 4.93 kΩ/μm
Reff,P x μm 10.83 kΩ/μm
RM2 / sq
March 14, 2008
0.07
2
8
16
2
6
10
14
16
1
2
8
1
3
5
7
8
Ω/sq
http://csg.csail.mit.edu/6.375/
L15-35
A good rule-of-thumb is to
target a stage effort around four
Cin
Cout
Minimum delay when:



Stage effort = logical effort x electrical effort ≈ 3.4-3.8
Some derivations use e = 2.718.. – this ignores parasitics
Broad optimum, stage efforts of 2.4-6.0 within 15-20% of
minimum
Fan-out-of-four (FO4) is convenient design size (~5t
FO4 delay: Delay of
inverter driving four
copies of itself
March 14, 2008
http://csg.csail.mit.edu/6.375/
L15-36