Transcript clock skew

Introduction to
CMOS VLSI
Design
Clock Skew-tolerant
circuits
Outline





Clock Distribution
Clock Skew
Skew-Tolerant Static Circuits
Traditional Domino Circuits
Skew-Tolerant Domino Circuits
CMOS VLSI Design
2
Review Timing Definitions
CMOS VLSI Design
3
Clocking
 Synchronous systems use a clock to keep
operations in sequence
– Distinguish this from previous or next
– Determine speed at which machine operates
 Clock must be distributed to all the sequencing
elements
– Flip-flops and latches
 Also distribute clock to other elements
– Domino circuits and memories
CMOS VLSI Design
4
Clock Distribution
 On a small chip, the clock distribution network is just
a wire
– And possibly an inverter for clkb
 On practical chips, the RC delay of the wire
resistance and gate load is very long
– Variations in this delay cause clock to get to
different elements at different times
– This is called clock skew
 Most chips use repeaters to buffer the clock and
equalize the delay
– Reduces but doesn’t eliminate skew
CMOS VLSI Design
5
Example
 Skew comes from differences in gate and wire delay
– With right buffer sizing, clk1 and clk2 could ideally
arrive at the same time.
– But power supply noise changes buffer delays
– clk2 and clk3 will always see RC skew
gclk
3 mm
clk1
1.3 pF
CMOS VLSI Design
3.1 mm
clk2
0.4 pF
0.5 mm
clk3
0.4 pF
6
Q1
F1
Skew Impact
clk
Combinational Logic
D2
F2
clk
Tc
clk
tpcq
Q1
tskew
tpd
tpdq
tsetup
D2
clk
F1
 Ideally full cycle is
available for work
 Skew adds sequencing
overhead
 Increases hold time too
Q1
CL
t pd  Tc   t pcq  tsetup  tskew 
D2
sequencing overhead
tcd  thold  tccq  tskew
F2
clk
tskew
clk
thold
Q1 tccq
D2
CMOS VLSI Design
tcd
7
Cycle Time Trends
 Much of CPU performance comes from higher f
– f is improving faster than simple process shrinks
– Sequencing overhead is bigger part of cycle
1000
100
MHz
SpecInt95
10
1
80386
80486
Pentium
Pentium II / III
0.1
0.01
1985
1988
1991
1994
80386
80486
Pentium
Pentium II / III
1997
10
1985
2000
1988
1991
1994
1997
2000
100
500
VDD = 3.3
VDD = 5
FO4 inverter delays / cycle
Fanout-of-4 (FO4) Inverter Delay (ps)
100
VDD = 2.5
200
100
50
2.0
1.2
0.8
0.6
0.35
0.25
50
80386
80486
Pentium
Pentium II / III
20
10
1985
1988
1991
1994
1997
2000
Process
CMOS VLSI Design
8
Solutions
 Reduce clock skew
– Careful clock distribution network design
– Plenty of metal wiring resources
 Analyze clock skew
– Only budget actual, not worst case skews
– Local vs. global skew budgets
 Tolerate clock skew
– Choose circuit structures insensitive to skew
CMOS VLSI Design
9
Clock Dist. Networks




Ad hoc
Grids
H-tree
Hybrid
CMOS VLSI Design
10
Clock Grids




Use grid on two or more levels to carry clock
Make wires wide to reduce RC delay
Ensures low skew between nearby points
But possibly large skew across die
CMOS VLSI Design
11
Alpha Clock Grids
Alpha 21064
Alpha 21164
Alpha 21264
PLL
gclk grid
Alpha 21064
CMOS VLSI Design
gclk grid
Alpha 21164
Alpha 21264
12
H-Trees
 Fractal structure
– Gets clock arbitrarily close to any point
– Matched delay along all paths
 Delay variations cause skew
A
 A and B might see big skew
CMOS VLSI Design
B
13
Itanium 2 H-Tree
 Four levels of buffering:
– Primary driver
– Repeater
– Second-level
clock buffer
– Gater
 Route around
obstructions
Repeaters
Typical SLCB
Locations
Primary Buffer
CMOS VLSI Design
14
Hybrid Networks
 Use H-tree to distribute clock to many points
 Tie these points together with a grid
 Ex: IBM Power4, PowerPC
– H-tree drives 16-64 sector buffers
– Buffers drive total of 1024 points
– All points shorted together with grid
CMOS VLSI Design
15
Skew Tolerance
 Flip-flops are sensitive to skew because of hard edges
– Data launches at latest rising edge of clock
– Must setup before earliest next rising edge of clock
– Overhead would shrink if we can soften edge
 Latches tolerate moderate amounts of skew
– Data can arrive anytime latch is transparent
CMOS VLSI Design
16
Skew: Latches
Q1
Combinational
Logic 1
D2
1
Q2
Combinational
Logic 2
D3
Q3
pdq
sequencing overhead
tcd 1 , tcd 2  thold  tccq  tnonoverlap  tskew
tborrow 
2
L3
 2t 
D1
L1
t pd  Tc 
1
L2
2-Phase Latches
1
2
Tc
  tsetup  tnonoverlap  tskew 
2
Pulsed Latches
t pd  Tc  max  t pdq , t pcq  tsetup  t pw  tskew 
sequencing overhead
tcd  thold  t pw  tccq  tskew
tborrow  t pw   tsetup  tskew 
CMOS VLSI Design
17
Dynamic Circuit Review
 Static circuits are slow because fat pMOS load input
 Dynamic gates use precharge to remove pMOS
transistors from the inputs
– Precharge:  = 0 output forced high
– Evaluate:
 = 1 output may pull low
A
B

C
D
A
B
Y
C
B
C
D
D
static
CMOS VLSI Design
A
Y
dynamic
18
Domino Circuits
 Dynamic inputs must monotonically rise during
evaluation
– Place inverting stage between each dynamic gate
– Dynamic / static pair called domino gate
 Domino gates can be safely cascaded
domino AND
W
X
A
B

dynamic static
NAND inverter
CMOS VLSI Design
19
Domino Timing
 Domino gates are 1.5 – 2x faster than static CMOS
– Lower logical effort because of reduced Cin
 Challenge is to keep precharge off critical path
 Look at clocking schemes for precharge and eval
– Traditional schemes have severe overhead
– Skew-tolerant domino hides this overhead
CMOS VLSI Design
20
Traditional Domino Ckts
 have high sequencing overhead, hard edge in each
half-cycle.
 first domino gates does not evaluate until rising edge
of the clock, but the results must set up at the latch
before falling edge of the clock
 If removing the latch, could soften the falling edge
and cut the overhead.
 The latch serves two functions:
– prevent nonmonotonic signals from entering the
next domino gate while it evaluates
– hold the results of the half-cycle while it
precharges and the next half-cycle evaluates.
CMOS VLSI Design
21
Traditional Domino Ckts
 Hide precharge time by ping-ponging between half-cycles
– When clk is high (low), the first half-cycle evaluates
(precharges) and the second precharges (evaluates)
– Latches hold results during precharge
– Overhead of each latch is setup time and D-to-Q propa.
delay. assume tpdq is larger, then time for compu. is tpd
Tc
t pd  Tc  2t pdq
clk
clk
tpdq
CMOS VLSI Design
Latch
Dynamic
clk clk
Static
Dynamic
clk
Static
Dynamic
clk
Static
Dynamic
Latch
Dynamic
clk clk clk
Static
Dynamic
clk
Static
Dynamic
clk
Static
Dynamic
clk
tpdq
22
Clock Skew
 Skew increases sequencing overhead
– Evaluate at latest rising edge
– Setup at latch by earliest falling edge
– Assume skew and setup time > propa. delay tpdq
clk
Latch
Dynamic
clk clk
Static
Dynamic
clk
Dynamic
clk
Static
clk clk
Latch
Static
clk
Static
Dynamic
clk
Dynamic
clk
Dynamic
time for computation tpd
t pd  Tc  2tsetup  2tskew
tsetup tskew
CMOS VLSI Design
23
Time Borrowing
 Logic may not exactly fit half-cycle
– No flexibility to borrow time to balance logic
between half cycles
 Traditional domino sequencing overhead is about
25% of cycle time in fast systems!
clk
Latch
clk
Static
clk
Dynamic
clk
Static
clk
Dynamic
Static
Dynamic
clk
Static
Dynamic
clk
Latch
clk
tsetup tskew
CMOS VLSI Design
24
Relaxing the Timing
 Sequencing overhead caused by hard edges
– Data departs dynamic gate on late rising edge
– Must setup at latch on early falling edge
 Latch functions
– Prevent glitches on inputs of domino gates
– Holds results during precharge
 Is the latch really necessary?
– No glitches if inputs come from other domino
– Can we hold the results in another way?
CMOS VLSI Design
25
Skew-Tolerant Domino
 Use overlapping clocks to eliminate latches at phase
boundaries.
– Second phase evaluates using results of first
No latch at
phase boundary
b
c
1
1
2
2
a
a
b
b
c
c
CMOS VLSI Design
Static
a
Dynamic
2
Static
Dynamic
1
d
26
Clks nonoverlapping, circuit fails
1. 1 falls, node a precharges high,
node b low
2. 2 rises, the input to the first
domino gate has fallen, i.e., b is low,
node c will never discharge and the
circuit loses information.
CMOS VLSI Design
27
Clks overlapping, circuit works
1 and 2 overlap, 2 rises
while b still holds correct
value, 2 evaluates using the
results of 1
2 is evaluates, b is low
node c is floating
CMOS VLSI Design
28
Full Keeper
 After second phase evaluates, first phase precharges
 Input to second phase falls
– Violates monotonicity?
 But we no longer need the value
 Now the second gate has a floating output
– Need full keeper to hold it either high or low

H
X
f
CMOS VLSI Design
weak full
keeper
transistors
29
Latch is unnecessary
 As long as the clock overlap is long enough that the
second phase can evaluate before the first
precharges, the latch between phases is
unnecessary
CMOS VLSI Design
30
Time Borrowing
 Overlap can be used to
– Tolerate clock skew
– Permit time borrowing
 No sequencing overhead
toverlap
tborrow tskew
1
Phase 1
CMOS VLSI Design
Static
Dynamic
2
Static
Dynamic
2
Static
Dynamic
2
Static
Dynamic
1
Static
Dynamic
1
Static
Dynamic
1
Static
Dynamic
1
Static
1
Dynamic
t pd  Tc
2
Phase 2
31
Multiple Phases
 With more clock phases, each phase overlaps more
– Permits more skew tolerance and time borrowing
1
2
3
4
Phase 1
CMOS VLSI Design
Phase 2
Phase 3
Static
Dynamic
4
Static
Dynamic
4
Static
Dynamic
3
Static
Dynamic
3
Static
Dynamic
2
Static
Dynamic
2
Static
Dynamic
1
Static
Dynamic
1
Phase 4
32
Clock Generation
en clk
1
2
3
4
CMOS VLSI Design
33
Summary
 Clock skew effectively increases setup and hold
times in systems with hard edges
 Managing skew
– Reduce: good clock distribution network
– Analyze: local vs. global skew
– Tolerate: use systems with soft edges
 Flip-flops and traditional domino are costly
 Latches and skew-tolerant domino perform at full
speed even with moderate clock skews.
CMOS VLSI Design
34