No Slide Title

Download Report

Transcript No Slide Title

Digital System Clocking:
Storage Elements in High-Performance and
Low-Power Systems
ISSCC 2002 uP Workshop
Final version of this presentation is available at:
http://www.ece.ucdavis.edu/acsel under Presentations
also look for a book under the same title in Summer 2002
Vojin G. Oklobdzija
University of California Davis
Integration Corp.
Berkeley, CA 94708
http://www.integration-corp.com
Outline
• Why working on Clocked Storage Elements ?
• M-S Latch is not a Flip-Flop !
• How do we compare them ?
 What are the relevant parameters ?
 What is an appropriate setup ?
• What do we use in high-performance
microprocessors ?
 How do they compare ?
• What should we do for low-power ?
 How do they compare ?
• What next ? Ideas, Suggestions, Insights
7/18/2015
Prof. V.G. Oklobdzija, University of California
2
Importance
7/18/2015
Prof. V.G. Oklobdzija, University of California
3
ISSCC-2002
Trends in high-performance systems: Higher clock frequency
2000
1800
Pentium 4
Nominal Clock Frequency (MHz)
1600
Athlon
1400
Athlon
1200
Alpha
G4
Power PC
1000
PIII
Crusoe
PII Xeon
Alpha 21264
600
21164
Exponential
Crusoe
IBM S/390
400
Alpha 21164
Cray -1 S
1980
Cray-X MP
StrongArm 110
IBM 3090
1985
K6
CyrixX86
Power PC 603
1990
UltraSparc II
Pentium Pro
Alpha 21064
200
0
1975
Itanium
Athlon
800
1995
2000
2005
Year
7/18/2015
Prof. V.G. Oklobdzija, University of California
4
Courtesy: Doug Carmean, Hot-Chips-13 presentation
7/18/2015
Prof. V.G. Oklobdzija, University of California
5
Processor Frequency Trend
Intel
IBM Power PC
DEC
Gate delays/clock
100
Processor Freq
scales 2X per
technology
generation
21264S
1,000
Mhz
Pentium III
21164A
21264
21064A
Pentium(R)
21164
21066
MPC750II
604
604+
10
P6
100
601, 603
Pentium(R)
Gate Delays/Clock Period
10,000
486
386
Courtesy of: Intel, S. Borkar
2005
2003
2001
1999
1997
1995
1993
1991
1989
1
1987
10
 Frequency doubles each generation
 Number of gates/clock reduce by 25%
7/18/2015
Prof. V.G. Oklobdzija, University of California
6
Why working on Clocked Storage Elements ?
Example:
In a 2.0 GHZ processor T=500pS
- Typically clocked storage element D-Q delay is in the
order of 100-150pS
- If one can design a faster CSE: e.g. 80-100pS D-Q, this
represents 10-15% performance improvement
- If in addition one can absorb 20pS of clock uncertainties
and embedd one level of logic – this can yield up to 20%
performance improvement
- Try to achieve 10-20% performance improvement by
introducing new features in the architecture !
- This is sufficient to turn an architect into a circuit
designer !
7/18/2015
Prof. V.G. Oklobdzija, University of California
7
Basic Definitions
7/18/2015
Prof. V.G. Oklobdzija, University of California
8
Clock Generation and Distribution Non-idealities
• Jitter
 Jitter is a temporal variation of the clock signal manifested as
uncertainty of consecutive edges of a periodic clock signal.
 It is caused by temporal noise events
 Manifested as:
- cycle-to-cycle or short-term jitter, tJS
- long-term jitter, tJL
 Characteristic of clock generation system
• Skew
 Is a time difference between temporally-equivalent or concurrent
edges of two periodic signals
 Manifests as SE-to-SE fluctuation of clock arrival at the same time
instance
 Characteristic of clock distribution system
 Caused by spatial variations in signal propagation
7/18/2015
Prof. V.G. Oklobdzija, University of California
9
Clock Skew and Jitter
tDRVCLK
Ref_Clock
tskew
t jit
tskew
t jit
Received Clock
T
tRCVCLK
7/18/2015
Prof. V.G. Oklobdzija, University of California
10
Difference between Latch and Flip-Flop
7/18/2015
Prof. V.G. Oklobdzija, University of California
11
Difference between Latch and Flip-Flop
Data
Q
F-F
After the transition of the
clock data can not
change
Clock
Clock
Data
Q
Data
Q
Latch
Clock
Clock
Latch is “transparent”
Data
Q
7/18/2015
Prof. V.G. Oklobdzija, University of California
12
Two-Phase Clocking with Two-Phase Double Latch
Source
Source
Master - L1S Slave - L2S
D
SET
CLR
Q
D
Q
SET
CLR
Q
Destination Destination
Master - L1D Slave - L2D
Combinational
logic
Q
D
SET
CLR
Clock 1
Clock 2
Q
Q
SET
D
CLR
Q
Q
Clock 2
Data Arrival
Cycle 1
Cycle 2
Clock 1
positive overlapp
between two clock
phases
critica
l path
t2
Clock 2
t1
7/18/2015
critical race
Prof. V.G. Oklobdzija, University of California
13
Two-Phase Clocking with One-Phase Double Latch
Source
Source
Master - L1S Slave - L2S
D
SET
CLR
Q
D
Q
Clk1
SET
CLR
Q
Combinational
logic
Q
D
SET
CLR
Q
Q
Clk1
Clk2
Data Arrival
Clock 1
Destination Destination
Master - L1D Slave - L2D
CLR
Q
Q
Cycle 2
critical path
L1 is driving through the
L2 and the logic into L1
SET
Clk2
Cycle 1
t1
D
t2
negative overlapp betwen the
two clock phases - no race
Clock 2
Some people refer to this latch arrangement as: “negative
edge Flip-Flop” !
7/18/2015
Prof. V.G. Oklobdzija, University of California
14
Flip-Flop and M-S Latch Arrangement
Data
F-F
Q
How can one recognize the
difference without knowing what
is inside the “black-box” ?
Clock
Clock
Data
Q
Master - L1
D
Data
Latch
Clock
Q
Q
D
Q
Clk1
Slave - L2
Q
Q
Clk2
Clock
Data
Q
7/18/2015
Prof. V.G. Oklobdzija, University of California
15
F-F and M-S Latch: Difference
Data
F-F
Experiment:
QFF
Clock
Data
Latch
QL
Clock
Clock
Data
QL
QFF
7/18/2015
Failed !
Prof. V.G. Oklobdzija, University of California
16
F-F and M-S Latch: Difference
Structural Difference:
Input
Input
Clock: 1
Pulse
Generator
Clock
Input
Q1
S
Slave
Latch
Q
Q
Pulse
Capturing
Latch
Flip-Flop
7/18/2015
Master
(L1)
Latch
Q1
R
Clock: 2
No Clock
Input
Prof. V.G. Oklobdzija, University of California
Slave
(L2)
Q2 Latch Q2
Q
Q
M-S Latch
17
PG Theory of Operation: Sn+1
Clk
S n 1  Clk  R  S n D
Not Allowed
S n 1  Clk R ( S  D)
-
-
-
-
1
1
0
0
1
1
0
1
Hold "1"
Capture "1"
Capture "0"
Rn
Hold
previous
state
Sn
1
1
1
1
Hold "0"
D
7/18/2015
Prof. V.G. Oklobdzija, University of California
18
Flip-Flop: Example-2
VDD
D
Clk
D=0
pulse S
R
D=1
Q
Q
SAFF DEC Alpha 21264 (Madden & Bowhill, 1990, Matsui 1994)
7/18/2015
Prof. V.G. Oklobdzija, University of California
19
F-F Derivation using Delayed Clock
Clk
Equivalent to:
Clk'
Clk'
CLK
CLK'
Clk
Clk
Sn+1
-
-
-
-
-
-
-
-
1
1
0
1
1
1
0
0
1
1
0
1
1
1
1
1
Rn
Sn
1
1
0
D
7/18/2015
1
1
1
1
D
1
S n1  Clk  Clk 'D  Clk ' S
Prof. V.G. Oklobdzija, University of California
20
Systematically Derived ET FF
Vdd
Vdd
Vdd
MP1bb
MP1
MP3
D
X
MN1
MP6
MP4
Vdd
MP2
MN7
D
Inv3
MN3
MN6
MN3bb
Inv1
Q
Inv5
Inv4
MN2
Clk
MP5
Qb
MN8
MN4
MN5
Inv2
N. Nedovic, V. G. Oklobdzija, “Dynamic Flip-Flop with Improved Power”, ICCD 2000, Sept. 2000
7/18/2015
Prof. V.G. Oklobdzija, University of California
21
Flip-Flop: Example (HLFF, H. Partovi)
Vdd
Second
Stage Latch Q
Q
D
D=1
Clk
D=0
D=0
D=1
signal at
node X
Pulse Generator
7/18/2015
Prof. V.G. Oklobdzija, University of California
22
Flip-Flop: Example (HLFF, H. Partovi)
Keepers
Second
Stage Latch
Data
Clk
D=1
Pulse
Generator
D=0
D=0
signal at
node X
D=1
7/18/2015
Prof. V.G. Oklobdzija, University of California
23
Timing and Power metrics
7/18/2015
Prof. V.G. Oklobdzija, University of California
24
Delay
• Sum of setup time U and Clk-Q delay is the only true
measure of the performance with respect to the system
speed
• T = TClk-Q + TLogic + Tsetup+ Tskew
D Q
Logic
D Q
N
Clk
Clk
T
TClk-Q
7/18/2015
TLogic
TSetup
Prof. V.G. Oklobdzija, University of California
TD-Q=TClk-Q + TSetup
25
Delay vs. Setup/Hold Times
350
300
Minimum Data-Output
Clk-Output [ps]
250
200
150
Setup
Hold
100
50
Sampling Window
0
-200
-150
-100
-50
0
50
100
150
200
Data-Clk [ps]
7/18/2015
Prof. V.G. Oklobdzija, University of California
26
Timing Characteristics
Data to Output Delay
Data to Output Delay
Metastable Region
Stable Region
Td-q
Tclk-q
Failure
tDQmin
Clock to Data Delay tSU-OPT
7/18/2015
Prof. V.G. Oklobdzija, University of California
27
Absorbing Clock Uncertainties
7/18/2015
Prof. V.G. Oklobdzija, University of California
28
Hybrid Latch Flip-Flop
Skew absorption
Partovi et al, ISSCC’96
7/18/2015
Prof. V.G. Oklobdzija, University of California
29
Power Consumption
•

Input power
• Data power (PD)
• Clock power (PCLK)


•
•
PD
All power related to the SE can be
divided into:
Internal power (PINT)
Load power (PLOAD)
PLOAD can be merged into PINT
Internal power is a function of


data activity ratio () – number of
captured data transitions with respect to
number of clock transitions
(max=100%)
• no activity (0000… and 1111…)
• maximum activity (0101010..)
• average activity (random sequence)
Glitching activity
7/18/2015
VDD
D
D
VDD
Q
VDD
CLK
PLOAD
CLK Qb
PCLK
Ptot  Pinternal 
PINT
P
driver
inputs(D, CLK)
Delay is (minimum D-Q)
Clk-Q + setup time
Prof. V.G. Oklobdzija, University of California
30
State Element Performance Metrics
It is always possible trade power for speed
Common metrics:
• Power-Delay Product (PDP)
• Misleading measure
• Good only if measured at constant frequency = EDP
• EDP - Energy-Delay Product (EDP)
 More accurate measure
• ED2P – Energy-Delay2-Product
 A new measure, being justified by new results (Hofstee, Nowka,
IBM)
7/18/2015
Prof. V.G. Oklobdzija, University of California
31
PDP,
EDP
Comparison
PDP, EDP vs. Process Corners
50
20
ED P _P C
P D P _P C
45
ED P _SA
P D P _SA
ED P _SD
P D P _SD
14
.
16
30
12
25
10
20
8
15
6
EDP (fJ*ns)
PDP (fJ)
.
40
18
35
10
4
5
2
0
0
High Voltage
HiV_Slow
HiV_Fast
HiV_Typ
LoV_Slow
Low Voltage
LoV_Fast
LoV_Typ
Process Corners
Slow Corner
7/18/2015
Prof. V.G. Oklobdzija, University of California
32
Design & optimization tradeoffs
PDPtot [fJ]
90
80
• Opposite Goals
70
60
 Minimal Total power
consumption
 Minimal Delay
50
40
30
20
• Power-Delay tradeoff
• Minimize Power-Delay
product (PDPtot) @ f=const.
Opt.
10
0
0
50
100
150
200
Total Power [uW]
70
60
PDPtot [fJ]
PDPtot [fJ]
90
80
50
40
30
20
Opt.
10
0
0
5
10
15
20
25
90
80
70
60
50
40
30
20
10
0
Opt.
0
200
400
Width [um]
7/18/2015
600
800
1000
Delay [ps]
Prof. V.G. Oklobdzija, University of California
33
1200
Clocked Storage Elements:
Examples
7/18/2015
Prof. V.G. Oklobdzija, University of California
34
Simulation Conditions:
•
•
•
•
•
Power Supply Voltage: VDD=1.8V nominal
Temperature T=27°C nominal
Technology: 0.18m Fujitsu
Fan-Out of 4 Delay = 75pS
Transistor Widths
 Minimal 0.36m
 Maximal 10m
• Load: 14 minimal inverters in the technology used
• Clock frequency: 500MHz (250MHz for Dual-Egde)
• Data/Clock slopes of ideal signal 100ps
7/18/2015
Prof. V.G. Oklobdzija, University of California
35
Transmission Gate MS Latch
• Two staticized transmission
gate transparent latches
• Direct path D-Q consists of
two transmission gates and
two regenerative inverters
• Two-phase clock
VDD
VDD
 Advantage: symmetric highto-low and low-to-high
transitions are achievable
 Disadvantage: large cost
associated with two-phase
clock distribution
Q
CLK
CLKB
D
CLKB
CLK
PowerPC 603 (Gerosa, JSSC 12/94)
tD [ps]
{fo4}
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
300
{4}
80.0
32.1
11.1
123.2
36.9
7/18/2015
Comments:
 Very low internal power.
 Large Total Power due to clock and
data load
Prof. V.G. Oklobdzija, University of California
36
C2MOS MS Latch
CLK
•
Forward path consists of two
clocked inverters - parts of
C2MOS latches
Degradation of speed due to
pMOS stacks
•
• Degradation in speed due to
non-ideal 2-phase clock
CLKB
CLK
CLKB
D
Q
CLKB
•
Large clock power (if not
buffered locally)
CLK
CLK
CLKB
Q
tD
[ps]
{fo4}
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
354
{4.7}
110.8
27.5
2.8
141.1
49.9
7/18/2015
CLK
CLKB
Y. Suzuki, “Clocked CMOS Calculator Circuitry”, IEEE J. Solid-State Circuits, Dec. 1973
Prof. V.G. Oklobdzija, University of California
37
SAFF: Strong Arm 110
• Staticized Sense Amplifier Flip-Flop
• Weak nMOS keeps set/reset signals
low
• Second stage – non-clocked SR latch
• Additional NMOS transistor causes
slightly increased power consumption
and delay degradation
• Bad timing characteristics due to the
latching stage. Signal propagates
through three stages.
• Unbalanced rising and falling time of
the output signals (speed degraded by
Vdd
Vdd
D
D
Clk
40%)
tD [ps]
{fo4}
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
PDP
[fJ]
323
{4.31}
79.7
4.2
0.5
84.8
27.4
7/18/2015
Prof. V.G. Oklobdzija, University of California
Q
Q
38
Modified SAFF
• The first stage is unchanged
sense amplifier
• Second stage is sized to
provide maximum switching
speed
• Driver transistors are large
• Keeper transistors are small
and disengaged during
transitions
V. Stojanovic, US Patent No. 6,232,810
Nikolic, Oklobdzija, Stojanovic ISSCC ‘99
7/18/2015
Prof. V.G. Oklobdzija, University of California
39
Systematicaly Derived SAFF: Example-2
VDD
Nikolic, Oklobdzija, ESSCIRC’99
MP1
• New pulse-generating stage
• Inverters decoupling gates
from MN3, MN4
• MN5, MN6 provide leakage
current paths
• Second stage is unchanged
MP2
MP3
I1
I3
I2
I4
MN3
MN4
MN5
D
MP4
MN6
MN1
Clk
D
MN2
MN7
S
R
VDD
R
S
MP7
MP8
MP5
Q
MP6
MP9
MP10
MN10
MN11
Q
V. Stojanovic, US Patent No. 6,232,810.
MN8
MN12
7/18/2015
Prof. V.G. Oklobdzija, University of California
MN9
MN13
40
Sense Amplifier-based Flip-Flop (SAbFF)
VDD
• Emerged as a workaround for SAFF
drawbacks
 floating nodes (keeping the Sb, Rb
nodes low with additional transistors
parallel to data-controlled
transistors)
 symmetric second stage (push-pull
realization)
D
D
Clk
• Internal signals still experience
transition on every clock cycle
S
R
VDD
R
S
V. Stojanovic, US Patent No. 6,232,810.
Q
Q
tD [ps]
{fo4}
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
169
{2.25}
100.8
5.8
1.3
107.9
18.2
7/18/2015
Prof. V.G. Oklobdzija, University of California
41
Comparison with other SAFFs
Clk-Output Delay [ps]
Nikolic, Oklobdzija, ESSCIRC’99
800
CMOS, nominal corner, Leff = 0.18m, VDD = 1.8V,
T = 25C, load on both outputs
700
Falling Egde SAFF
w/NOR
600
500
Rising Egde
SAFF
w/NAND
400
300
200
Falling Egde SAFF
Rising Egde SAFF
this work
this work
100
Rising Egde
SAFF
0
0
7/18/2015
50
100
150
Prof. V.G. Oklobdzija, University of California
200
Load [fF]
250
42
Conditional Capture Flip-Flop (CCFF)
0.18m Fujitsu; f = 500MHz; VDD = 1.8V; Data activity 50%
• Principle of Operation
Vdd
 Suppress any transition in flip-flop if
the input to be captured is equal to
Sbb
previous output value
Clk
Clk
Rb
Sb
Q
Qb
Rbb
• Double-ended realization
 FF functionality achieved by
producing clock pulse
 Static operation by use of keepers
 Second stage is pass-transistor latch
Comments
D
Clk
Sb

Contention with keepers causes larger first stage

Large power consumption despite conditional signaling
tD
[ps]
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
169
112.5
17.0
2.6
132.1
7/18/2015
Db
Rb
Q
Qb
Rbb
EDP
[fJ/500MHz]
22.3
Sbb
B. S. Kong, et all, ISSCC 2000
Prof. V.G. Oklobdzija, University of California
43
Partovi’s HLFF
•
•
•
Hybrid Latch-Flip-Flop
combination
Negative set-up time of
-80pS
Robustness to clock skew
and fast clocking
Vdd
Q
Q
D
Our simulations show
Clk
tD [ps]
fo4
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
188
{2.51}
161.3
18.0
4.4
183.8
34.5
•
•
Gains


7/18/2015
speed
(negative setup time)
robustness to clock skew
AMD K-6, Partovi, ISSCC’96
Drawbacks


sensitivity to clock slope
relatively high internal power (due to
precharge)
Prof. V.G. Oklobdzija, University of California
44
Semi-Dynamic Flip-Flop
•
•
•
Hybrid combination used in
UltraSPARC-III
Very fast circuit ( 173pS
Clk-Q delay .18u
technology, 1.8V, 27oC )
Problem D=Q=1:
Vdd
Vdd
Q
Q
D
Clk
F. Klass, VLSI Circuits’98
Our simulations shows
tD
[ps]
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
169
188.6
34.1
2.7
224.9
38.1
7/18/2015
• Negative setup time
• Feature of small penalty for
embedded logic
• Relatively high internal
power consumption and clock
load
Prof. V.G. Oklobdzija, University of California
45
Transmission Gate Flip-Flop (TGFF)
VDD
o Two transmission gates define
transparency window
o Time window with non prechargeevaluate structure

CLK
CKB
Low input activity => low output
activity
CKB
QB
CLK
Q
D
CKBB
tD [ps]
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
292
{3.89}
110.5
8.7
9.3
128.5
37.5
CLK
Comments:
CLKB
CLKBB
CLKB
CLKB
CK
CKB
CKBB
 Two transmission gates increase delay
 Noticeable data power
7/18/2015
Prof. V.G. Oklobdzija, University of California
46
Comparison
7/18/2015
Prof. V.G. Oklobdzija, University of California
47
Delay Comparison
(50% activity)
Overall
Results
400
MS Latch
Flip-Flop
Differential
350
4 fo4
Delay [ps]
300
250
200
2 fo4
150
100
50
7/18/2015
Prof. V.G. Oklobdzija, University of California
F
DE
CC
F
F
SA
bF
rm
gA
St
ro
n
TG
FF
F
CC
F
SE
SD
FF
HL
FF
OS
C2
M
Po
wP
C
0
48
Overall Results
Single-Edge Triggered Structures Power Consumption Comparison
(50% activity)
Internal Power [uW]
MS Latch
Data Power [uW]
Single Ended
Dual Ended
200
150
100
50
7/18/2015
Prof. V.G. Oklobdzija, University of California
FF
E
CC
F
D
SA
bF
rm
gA
St
ro
n
PF
F
TG
C
FF
CC
SD
FF
LF
F
H
ST
C
D
SS
TC
SE
C
2M
wP
C
O
S
0
Po
Power Consumption [uW]
250
Clock Power [uW]
49
Power Consumption
vs. Data Activity
Overall Results
0% (0)
0% (1)
pow 33%
pow 50%
pow 100%
300
MS Latches
Single Ended
Dual Ended
200
150
100
50
7/18/2015
Prof. V.G. Oklobdzija, University of California
FF
C
C
E
D
SA
bF
F
m
ng
Ar
St
ro
FF
TG
FF
SE
C
C
FF
SD
LF
F
H
SS
TC
ST
C
D
O
S
2M
C
PC
0
Po
w
Power [uW]
250
50
Conventional Clk-Q vs. minimum D-Q
400
HLFF
Total power [uW]
350
PowerPC
300
Strong Arm FF
250
200
Alpha 21264 FF
150
mC2MOS latch
100
K6 ETL
50
0
150
SSTC
200
250
300
350
400
450
500
550
600
650
Delay [ps]
SDFF
400
HLFF
350
Total Power [uW]
DSTC
PowerPC
300
Strong Arm FF
250
200
Alpha 21264 FF
150
mC2MOS latch
100
K6 ETL
50
0
100
SSTC
150
200
250
Clk-Q delay [ps]
7/18/2015
300
350
DSTC
SDFF
Prof. V.G. Oklobdzija, University of California
•
•
Hidden positive
setup time
Degradation of
Clk-Q
Older 0.22u comparison results
51
Internal Power distribution
Internal Power [uW]
400
350
300
250
200
150
100
50
0
Random,
activity=0.5
…01010101…
activity=1
…11111111…
activity=0
…00000000…
activity=0
Data patterns
HLFF
SDFF
PowerPC 603 latch
mC2MOS latch
StrongARM FF
Alpha 21264 FF
K6 ETL
•
Four sequences characterize the boundaries for internal power consumption




7/18/2015
…010101…
random, equal transition probability,
…111111…
…000000…
maximum
average
precharge activity
leakage + internal clock processing
Prof. V.G. Oklobdzija, University of California
Older 0.22u comparison results
52
Comparison of Clock power consumption
DSTC MS latch
SSTC MS latch
K6 ETL
StrongArm FF
SA-F/F
mC2MOS
PowerPC MS latch
SDFF
HLFF
0
10
20
30
40
50
Local Clock power consumption [W]
Older 0.22u comparison results
7/18/2015
Prof. V.G. Oklobdzija, University of California
53
Design for Low-Power
Conditional Pre-charge / Capture
Techniques
Conditional Capture Flip-Flop
•
Use conditional capture idea

•
To equalize 1=>0 and 0=>1 set-up
times, the signal from the middle of
the stack (Y) controls HL transition
on Q


•
When Q=1, 1=>0 transition of X is
prohibited
VDD
VDD
VDD
VDD
Q
X
CLKBB
Y is output of the first stage of
domino-like inverter, obtained
almost for free
Easy logic embedding
QB
Y
D
First stage has dynamic behavior
only in transparency window
CLKBB
CLK
GND GND
GND
tD [ps]
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/500MHz]
257
{3.43}
110.8
10.2
0.7
121.7
31.3
7/18/2015
(Im-CCFF: Nedovic, Oklobdzija, ICECS 2001)
Prof. V.G. Oklobdzija, University of California
56
Power Consumption
Comparison
vs. Data Activity
Power Consumption
vs. Data Activity
pow D=0
300
Conditional
Power [uW]
250
pow D=1
pow 33%
Nedovic, Oklobdzija
SBCCI 2000
ICECS 2001
pow 50%
pow 100%
Non-Conditional
MS Latches
200
150
100
50
0
CCFF
CPFF
ACPFF
Im-CCFF
HLFF
TG MS
Latch
C2MOS MS
NOTE: Conditional flip-flops behave like MS latches with respect to input data activity
7/18/2015
Prof. V.G. Oklobdzija, University of California
57
Dual-Edge-Triggered
Clocked Storage Elements
DET-CSE
Dual-Edge Triggered CSE
• Dual-Edge Triggered Clocked
Storage Element (DET-CSE)
samples the input data on both
edges of the clock
• Useful for reducing overall power
consumption
 Uses half of the original clock
frequency for the same data
throughput
 Roughly half of clock
generation/distribution/SE-clockrelated power is saved
• However, an overhead of more
complex design may exists
D
D
Q
Q
Clk
Clk
D
Q
tSU,r tCQ,r
7/18/2015
Q
Prof. V.G. Oklobdzija, University of California
tSU,f tCQ,f
59
Dual-Edge Triggered Storage Elements
• Structurally, two different designs are distinguished
 a) Latch-Mux (LM)
 b) Flip-Flop (FF)
• Classification very similar to single edge triggered SE
D
D
Q
D
D
S
C
R
0
C
Q
Qb
S
Q
R
Qb
Q
1
D
Q
C
Qb
CLK
CLK
S
C
R
Non-transparency
achieved by MUX
a)
DET-Latch
7/18/2015
D
Qb
b)
DET-Flip-Flop
Prof. V.G. Oklobdzija, University of California
60
Transmission Gate Latch-MUX (TGLM)
• Dual-edge counterpart of PowerPC
MS latch
• Mux – pass-transistor manner
VDD
Mp1
CLKB
 Smaller delay compared to SingleEdge TGMS Latch
Mp2
CLK
INV1
SWIN
Mn2
CLK
• Original design has single-phase
input clock
CLKB
CLK
SWOUT
CLKB
Mn1
GND
VDD
D
 Second phase is generated locally
 Better global power savings
 Degradation in speed
QB
Q
INV2
Mp1b
CLK
Mp2b
CLKB
CLKB
INV1B
SWINB
tD
[ps]
PI
[W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/250MHz]
DE
322
83.3
20.5
7.8
111.6
35.9
SE
300
80.0
32.1
11.1
123.2
36.9
Mn2b
CLKB
CLK
Mn1b
GND
CLK
0.18m Fujitsu; f = 250MHz (500MHz for Single-Edge); VDD = 1.8V; Data Activity 50%
7/18/2015
SWOUTB
CLK
Prof. V.G. Oklobdzija, University of California
CLKB
INV3
61
C2MOS Latch-MUX
• Latches – incomplete C2MOS with
shared clocked transistors
• Mux
VDD
CLK
N4
 Exactly one path is ‘ON’ at each
moment
 Simple connection of latch outputs
(wired-OR mux) simplifies the design
and saves performance
Mp1
Mp3
Mn1
Mn3
N5
CLKB
0.18m Fujitsu; f= 250MHz (500MHz for Single-Edge)
VDD = 1.8V; Data activity 50%
DE
268
SE
354
7/18/2015
INVOUT
Mn4
QB
Q
N3
Mn2
GND
VDD
N2
PI [W]
Mp4
D
CLKB
tD
[ps]
Mp2
N2
PD
[W]
PTOT
[W]
[fJ/250MHz]
122.9
27.3
8.1
158.3
42.5
110.8
27.5
2.8
141.1
49.9
INVCLK
Mp1b Mp3b
Mp4b
Mn1b Mn3b
Mn4b
N3
PCLK
[W]
Mp2b
N4
EDP
Prof. V.G. Oklobdzija, University of California
CLK
CLK
CLKB
N5
Mn2b
GND
62
TG Flip-Flop
•
•



•
VD
Simple design with 2 pass-transistor TG’s
and buffering inverter
Original design
D
CL
K
CK
B
CK
B
Q
B
CL
K
Q
D
semi-dynamic (only high level is kept)
Only n-MOS pass-gates
Degraded performance
CLK
B
CKB
B
CLKB
B
Two TG’s impair the driving capability
CLK
B
CL
K
CLK
B
(a
)
CK
B
C
K
CKB
B
VD
D
CK
B
0.18m Fujitsu; f= 250MHz (500MHz for Single-Edge) VDD = 1.8V;
Data activity 50%
tD [ps]
PI [W]
PCLK
[W]
PD
[W]
PTOT
[W]
EDP
[fJ/250MHz]
DE
374
104.7
7.8
13.2
125.7
47.1
SE
292
110.5
8.7
9.3
128.5
37.5
7/18/2015
D
Prof. V.G. Oklobdzija, University of California
CKB
B
CKB
B
CK
B
CLK
B
CL
CL K
K
CLKB
B
CK
B
Q
B
CLK
B
CLK
B
CL
K
CLKB
B
CK
B
Q
CLK
B
CL
K
(b)
63
Comparison with Single Edge SEs
EDP [fJ/500MHz], [fJ/250MHz
60
Single Edge
Dual Edge
50
40
30
20
10
0
TGLM/MS
7/18/2015
C2MOS
TGFF
Prof. V.G. Oklobdzija, University of California
TSPC
64
Comparison with Single Edge SEs
300
SE
DE
200
150
100
50
Activity = 0.5
7/18/2015
Activity = 1
Activity = 0 [1s]
Prof. V.G. Oklobdzija, University of California
TSPC
TGFF
C2MOS
TGMS/LM
TSPC
TGFF
C2MOS
TGMS/LM
TSPC
TGFF
C2MOS
TGMS/LM
TSPC
TGFF
C2MOS
0
TGMS/LM
Total Power [μW]
250
Activity = 0 [0s]
65
Single and Double Edge Triggered SE: Power Consumption
(a=50%)
Internal Power
Clock Power
Data Power
140
Power [uW]
120
100
80
60
40
20
0
Pow erPC
7/18/2015
DETTGLM
SETC2MOS DETC2MOS
SETTGFF
DETTGFF
Prof. V.G. Oklobdzija, University of California
SETTSPC
DETTSPC
66
DET-CSE Power vs. Delay
DET-CSE:
Our designs Power vs. Delay
300
TGLM
Total Power [uW]
250
C2MOS
200
150
100
50
50fJ/250MHz
TGFF
40fJ/250MHz
TSPC
30fJ/250MHz
DETACPFF
20fJ/250MHz
DETCPFF
10fJ/250MHz
DETDTFF
0
180
200
220
240
260
280
300
320
340
360
380
400
Delay [ps]
Fo4=2.9
7/18/2015
Prof. V.G. Oklobdzija, University of California
67
Overall Comparisons
Overall
Results
EDP (50% activity)
Single Ended - Dual
Ended Structures
EDP-Comparison
60
MS
Latches
50
40
Differential
MS
Latches
Our designs
30
Our designs
20
10
Conditional
Conditional
7/18/2015
DS
TC
SS
TC
CC
F
GF F
LF
Di
ff C F
P
Di
ff A F F
CP
St
FF
ro
ng
Ar
m
SA
bF
F
DE
Po
wP
C2 C
MO
S
CC
SE FF
C
SE PFF
AC
PF
F
im
CC
FF
DT
DT FF
FF
-R
DT
P
FF
-S
YM
HL
FF
SD
FF
TG
FF
0
SE
EDP [fJ/500MHz]
Single Ended
Prof. V.G. Oklobdzija, University of California
69
Delay Comparison
Overall Results – Delay
600
Single Ended
MS
Latches
Differential
MS
Latches
400
300
Fo4=4
Our designs
Our designs
200
100
7/18/2015
Prof. V.G. Oklobdzija, University of California
DS
TC
SS
TC
CF
F
G
F
Di LFF
ff
C
PF
Di
F
ff
AC
St PFF
ro
ng
Ar
m
SA
bF
F
C
DE
Po
wP
C2 C
M
O
S
HL
FF
SD
FF
TG
FF
C
C
SE FF
C
SE PFF
AC
PF
F
im
CC
FF
DT
DT FF
F
DT F-R
FF P
-S
YM
0
SE
Delay [ps]
500
70
Clock Power Consumption Comparison
New Structures: Clock Power Consumption
40
Single Ended
35
Differential
MS
Latch
30
25
20
15
10
Conditional
Our Designs
5
Conditional
Our Designs
7/18/2015
C
DS
T
SS
TC
CC
FF
G
FL
FF
Di
ffC
PF
Di
ffA F
C
PF
St
F
ro
ng
Ar
m
SA
-B
as
ed
DE
Po
wP
C
C2
M
O
S
DT F
FF
- rp
DT
FF
-s
ym
HL
FF
SD
F
TG F
CP
FF
DT
F
F
AC
PF
Im F
C
CF
F
CP
F
CC
FF
0
SE
Clock Power [uW]
MS
Latch
Prof. V.G. Oklobdzija, University of California
71
Dual Ended
Structures
EDP Comparison
EDP
vs. Data- Activity
0% (GND)
0% (VDD)
33% activity
50% activity
50
100% activity
Non-Conditional
Conditional
45
EDP [J/500MHz]
40
35
Our designs
30
25
20
15
10
5
0
DE CCFF
7/18/2015
GFLFF
Diff CPFF
Diff ACPFF
Prof. V.G. Oklobdzija, University of California
StrongArm
SAbFF
72
Power vs. Delay (Single-Ended)
350
Data activity points: 0% (0s), 0% (1s), 33%, 50%, 100%
300
Power [uW]
250
200
50fJ/500MHz
150
40fJ/500MHz
30fJ/500MHz
100
SE CCFF
CPFF
ACPFF
imCCFF
DTFF
DTFF-RP
DTFF-SYM
HLFF
SDFF
PowPC
C2MOS
TGCPFF
20fJ/500MHz
50
10fJ/500MHz
0
150
7/18/2015
200
250
300
350
Delay [ps]University of California
Prof. V.G. Oklobdzija,
400
73
Power vs. Delay (Differential)
300
40fJ/500MHz
250
Power [uW]
200
StrongArm
SAbFF
DE CCFF
GFLFF
Diff CPFF
Diff ACPFF
Our Designs
30fJ/500MHz
150
20fJ/500MHz
100
10fJ/500MHz
50
5fJ/500MHz
0
140
Fo4=2.07
7/18/2015
160
180
200
220
240
260
280
300
320
340
Delay [ps]
Prof. V.G. Oklobdzija, University of California
74
Power vs Delay (Differential, zoomed)
300
SAbFF
Power consumption [uW]
250
DE CCFF
Power
vs. Delay (Differential)
40fJ/500MHz
GFLFF
Diff CPFF
Diff ACPFF
200
30fJ/500MHz
150
20fJ/500MHz
100
10fJ/500MHz
50
5fJ/500MHz
0
150
Fo4=2.07
7/18/2015
155
160
165
170
175
180
Delay [ps]
Prof. V.G. Oklobdzija, University of California
75
Dual-Edge vs. Conditional vs. Conventional:
Dual-Edge vs.
Conditional vs. Conventional
Power with input activity
300
Power [uW]
250
Dual-Edge
Latch
Mux
Conditional
Flip-Flop
Conventional
Singleended
Diff.
Single-Ended
Diff.
200
150
100
50
data activity=0% (vdd)
7/18/2015
activity=33%
F
SA
bF
rm
St
ro
ng
A
SD
FF
HL
FF
DT
FF
DT
FF
-S
YM
PF
F
F
Di
ff
C
FL
F
G
FF
CC
im
CP
FF
O
S
DE
TC
PF
F
DE
TD
TF
F
C2
M
TG
LM
0
activity=50%
Prof. V.G. Oklobdzija, University of California
76
Dual-Edge
vs. Conditional vs.vs.
Conventional:
Delay
Dual-Edge
vs. Conditional
Conventional
350
Dual Edge
Conditional
Fo4=4
250
200
150
100
Fo4=2
Latch-Mux Flip-Flop
SingleEnded
Diff
Single-Ended
Diff
50
7/18/2015
D
DT TFF
FF
-S
YM
HL
FF
SD
FF
St
ro
ng
Ar
m
SA
bF
F
CP
FF
im
CC
FF
GF
LF
Di
ff C F
PF
F
0
TG
LM
C2
M
DE OS
TC
PF
DE
F
TD
TF
F
Delay [ps]
300
Conventional
Prof. V.G. Oklobdzija, University of California
77
Dual-Edge vs. Conditional vs. Conventional:
Dual-Edge vs.EDP
Conditional
vs. Conventional
with input activity
activity=33%
St
ro
ng
A
F
rm
Diff.
SD
FF
HL
FF
DT
FF
-S
YM
Single-Ended
DT
FF
Di
ff
C
data activity=0% (vdd)
7/18/2015
PF
F
F
Diff.
GF
LF
im
CC
FF
Singleended
CP
FF
PF
F
DE
TD
TF
F
OS
DE
TC
C2
M
TG
LM
EDP [fJ/250MHz, fJ/500MHz]
80 Latch Mux
Flip-Flop
70
60
50
40
30
20
10
0
Conventional
SA
bF
Conditional
Dual-Edge
activity=50%
Prof. V.G. Oklobdzija, University of California
78
Dual-Edge
Conditional vs.vs.
Conventional:
Dual-Edge
vs. vs.
Conditional
Conventional
Power Break-up @ 50% activity
Dual Edge
200
Conditional
Conventional
Data
Clock
Internal
150
100
50
LatchMux
Pulsed
Latch
SingleEnded
Diff
Single-Ended
Diff
7/18/2015
DT DTF
FF F
-S
YM
HL
FF
S
St DFF
ron
gA
r
SA m
bF
F
CP
i m FF
CC
F
GF F
D i LF F
ff C
PF
F
0
TG
C 2 LM
DE M OS
TC
DE PFF
TD
TF
F
Power [uW]
250
Prof. V.G. Oklobdzija, University of California
79
What to do and what to expect ?
• Important:
 Incorporating logic into the CSE
 Absorbing clock skew
 Quiet state (battery powered applications)
•
•
•
•
•
Pipeline boundaries will start to blur
CSE will be mixed with logic
Wave pipelining, domino style, signals used to clock
Synchronous design only in a limited domain
Asynchronous communication between synchronous
domains
7/18/2015
Prof. V.G. Oklobdzija, University of California
80