396-ps 32-bit Han-Carlson ALU in 180nm TSMC process Liang-Kai Wang VLSI CAD Lab

Transcript 396-ps 32-bit Han-Carlson ALU in 180nm TSMC process Liang-Kai Wang VLSI CAD Lab

396-ps 32-bit Han-Carlson ALU
in 180nm TSMC process
Liang-Kai Wang
VLSI CAD Lab
University of Wisconsin, Madison
2016/5/26
1
Outline








2016/5/26
Review of Adders
The Idea of Han-Carlson Adder
The Implementation of HanCarlson Adder
Simulation Result
Discussion
Comparison between Ling’s and
H-C Adder
Future work
Reference
2
Review of Adders

Carry Ripple Adder
2016/5/26
3
Review of Adders(cont.)

Carry Skip Adder
2016/5/26
4
Review of Adders(cont.)

Carry-Select Adder
2016/5/26
5
Review of Adders(cont.)

2016/5/26
Carry-Save Adder
6
Review of Adders(cont.)

2016/5/26
Carry Lookahead Adder
7
Review of Adders(cont.)

Ling Adder
gi  ai bi
pi  ai  bi
H1  g 7  g 6  p6 g5  p6 p5 g 4
ti  gi  pi (transfersignal)
Observation:
gi  pi gi
s7  t7  c7
c7  G1  P1G0
H 0  g3  g 2  p2 g1  p2 p1g 0
P6:3  p6 p5 p4 p3
h8  H1  P6:3 H 0 (Psudo Carry)
c7  p7 h8
 s 7  t 6  ( p7 h8 )
G1  G7:4  g 7  p7 g 6  p7 p6 g5  p7 p6 p5 g 4
 p7 ( g 7  g 6  p6 g5  p6 p5 g 4 )
G0  G3:0  g3  p3 g 2  p3 p2 g1  p3 p2 p1g 0
 p3 (g3  g 2  p 2g1  p 2p1g 0 )
P1  P7:4  p7 p6 p5 p4
2016/5/26
Back
8
Review of Adders(cont.)
 Hybrid
(Parallel) Prefix Adder
Brent-Kung Adder
Kogge-Stone
Han-Carlson Adder
2016/5/26
9
Review of Adders(cont.)

Brent-Kung Adder

Cost : C(k)=C(k/2)+k-1=2k-2-log2k (# of adder
cells)

2016/5/26
Time : 2*log2k – 2 (in terms of adder levels)
10
Review of Adders(cont.)

Kogge-Stone Adder

Cost : klog2k-(k-1)

Time : log2k
2016/5/26
11
The idea of Han-Carlson Adder

Han-Carlson Adder

B-K adder: small area, but slow

K-S adder: large area, but fast


Speed: 2log2n-2log2n (1/2 reduction)

Cost: 2k-2-log2kklog2k-k+1 (log2k/2 increase)
The area-time tradeoff results in HanCarlson Adder
2016/5/26
12
The idea of Han-Carlson Adder
(cont.)

Han-Carlson Adder


2016/5/26
Cost : O(k/2*log2k)
Time : O(log2k+1)
13
Review of Adders(cont.)

Optimized Brent-Kung Adder

Cost : C(k)=C(k/2)+k-1=2k-2-log2k

Time : log2k (in terms of adder levels)
2016/5/26
14
The idea of Han-Carlson Adder
(cont.)
B-K
original
K-S
B-K
Optimal
H-C
Cost 2k-2-log2k klog2k-(k-1) 2k-2-log2k k/2*log2k
Delay 2*log2k – 2
Max. FO
2016/5/26
2
log2k
log2k
2
4
log2k+1
2
15
The idea of Han-Carlson Adder
(cont.)
Produce Generate, Propagate, and Partial
Sum bit in the first stage.
 Single-rail circuit with double-rail in the
last stage to perform XOR function.
Sum=Partial_Sum XOR CarryIn
 Improved: Domino circuit with odd
stage in Dynamic and even stage in
Static.

2016/5/26
16
The implementation of HanCarlson Adder





2016/5/26
Schematics Design by Composer, Simulation by
Spectre. Both of them are in the Cadence design
kits
The simulation result is from Schematic (prelayout)
The best speed is achieved by using the fast mode
in the technology file instead of tuning the Bulk
voltage
Clock is generated by ring oscillator with five
inverters in the loop.
Cadence tutorial for both of them and about how
to setup the environment are provided here.
17
The implementation of HanCarlson Adder(cont.)

Clock generation:

Ring Oscillator : five inverters followed by
lots of buffers
output
trigger
NMOS
2016/5/26
18
The implementation of HanCarlson Adder(cont.)

Clock distribution
PG gen.
S0
S1
S2
S3
S4
stclk3
stclk2
Sum.
Sum#
Ø1
Sum gen.
Ø2
Correct
Latch
2016/5/26
19
The implementation of HanCarlson Adder(cont.)

The whole view
A B Carry In
Path for P and G bit
PG gen.
CM0 CM1 CM2 CM3 CM4
M1
Foot-transistor
added
M2
Path for Psum bit
Correct
Sum
Sum #
2016/5/26
Sum gen.
20
The implementation of HanCarlson Adder(cont.)

ALU PG/Partial Sum Circuit.
Psum  A  B  AB  AB  ( A  B)( A  B)  PG
Back
2016/5/26
21
The implementation of
Han-Carlson Adder (cont.)

Dynamic and Static Carry Merge Stage :

 i=0, 2,…30
G  Gi  Pi  Gi 1 
P  Pi  Pi 1
Even Stage :
Odd Stage:
2016/5/26
P  Pi 

G  Gi 
i=1, 3, … 31, or the carry at
that bit is already got.


P  Pi  Pi 1
 i  0, 2, ..., 30
G  Gi ( Pi  Gi 1 ) 


P  Pi 
 i  1, 3, ...,31
G  Gi 

or the carry at that bit is acquired
22

Dynamic and Static Carry
Merge Stage (cont.):


2016/5/26
Carry-In of LSB should be
merged in order to do
subtraction.
The generate and propagate bit
MSB are passed to the last
stage to produce the carry_out
of the ALU. (for the check bit)
23
The implementation of
Han-Carlson Adder (cont.)

2016/5/26
Even/Odd-bits CSG Sum
Generation
Complementary
signal generator
(CSG) logic
24
The implementation of HanCarlson Adder (cont.)

Even/Odd-bits CSG Sum Generation
 Use a latch to increase noise tolerance
Carry_bar
Carry
2016/5/26
25
Simulation Result

Try the worst case pattern to test this design:
 A=0, B=-2, Carry-In=1 is the worst case delay.
 Why? Because from the structure of the circuit,
the worst case is 3N-2P-2N-2P-2N-2P-3N (For
Propagate bit)
2016/5/26
26
Simulation Result (cont.)
0th
stage: Carry-In=1
1st
stage: g=0, p=0, Psum=0 (P/G/Psum, 3N)
2nd
stage: g# =1, p# =1 (Static, 2P)
3rd stage: g=0, p=0 (Dynamic, 2N)
4th stage: g# =1, p# =1 (Static, 2P)
5th Stage: g=0, p=0 (Dynamic, 2N)
6th
stage: g# =1, p# =1 (static, 2P)
7th stage: Cin31=0, (Dynamic, 3N)
The
result should be “2” Correct = 1
2016/5/26
27
Simulation Result (cont.)
2016/5/26
28
Simulation Result (cont.)

The result window
2016/5/26
29
Simulation Result (cont.)



2016/5/26
Test if the error flag is correct.
1st Test pattern: A=-231 B=-1.
The answer is 231-1
(1’b0+31’b1), which is the
wrong answer. And the correct
bit should be equal to 0. (test
the lower bound)
Also check the clock period is
about 396.23ps
30
Simulation Result (cont.)
2016/5/26
31
Simulation Result (cont.)

2nd Test pattern: A=231-1 B=2. The
answer is -231+1 (1’b1 +30’b 0+1’b1,
wrong answer), the correct bit should
be equal to 0. (test the upper bound)
2016/5/26
32
Simulation Result (cont.)
2016/5/26
33
Discussion: P/G/Psum Block
P circuit
G circuit
Psum circuit
2016/5/26
Mine
Psum= A xor B
34
Discussion (cont.)

What might be the problem?




Longer path to the ground
When pre-charge, both of the propagate and
generate bit are “1”
What we need to consider? If p=0, g=0, this
circuit may have a good performance.
However, what if g goes from 1 to 0, but p=1?
2016/5/26
35
Discussion (Cont.)
2016/5/26
36
Discussion (cont.)

If the longest path is cut, then…
Mine
2016/5/26
37
Discussion (Cont.)

Mine
2016/5/26
38
Comparison between H-C
adder and Ling Adder

Ling Adder:


For n-bit Ling adder combining r groups
critical path:

“logrn-1” levels





2016/5/26
r1 reduction result in logrn levels,
“-1” is because of the using of CLA expression rather than
Ling’s expression for the last group. Therefore, additional
stage is saved.
The worst case delay will remain the second path from the
last block
For each block, there are r+1 transistors serially
connected.
Use carry-select block for the generation of Sum bit. Only
additional “2” gate delays needed.
39
Comparison between H-C
adder and Ling Adder(cont.)


Lookahead Network
Td=(logrn-1)(r+1)+2
E.g. r=3, n=32, Td=14
CLA expression
2016/5/26
Group Generation
Carry-Select structure
(MUX)
40
Comparison between H-C
adder and Ling Adder(cont.)

H-C Adder:




P, G generation =3
Carry Merge in each stage (including
dynamic and static) = 2
CSG Sum = 5
Td=2*log2n+3(P, G generation)+5 (CSG
Sum)

2016/5/26
E.g. n=32, Td=18
41
Comparison between H-C
adder and Ling Adder(cont.)

What is the pros and cons?

Ling Adder:


Advantage: shorter worse case path  might
be faster theoretically.
Disadvantage.:




2016/5/26
not regular layout Area waste
Lots of complex gates imply the charge sharing
problem.
Lots of input for a stage contribute to the long path
of wire  delay problem for high frequency
Carry-Select logic makes the area bigger.
42
Comparison between H-C
adder and Ling Adder(cont.)

Han-Carlson Adder:


Disadvantage. : Longer path to the output
Advantage.:



2016/5/26
Regular layout for each stage
Fewer of inputs for each path imply the
resolution of interconnection
Simpler gates means few charge sharing
problem
43
Future Work





Power Reduction by inserting sleep transistors
Speed improvement by inserting discharge
transistors in the intermediate stack nodes of
the dynamic stages during precharge phase.
Area Reduction in layout
SOI model test
Self-Resetting to minimize the clock period
2016/5/26
44
Reference



2016/5/26
A 6.5GHz 130nm Single-Ended
Dynamic ALU and Instruction
Scheduler Loop, ISSCC 2002
Sub-500-ps 64-b ALUs in 0.18-um
SOI/Bulk CMOS: Design and
Scaling Trends, JSSC, Nov, 2001
Fast Area-Efficient VLSI Adders,
Proc. 8th Symp. Computer
Arithmetic, Sept. 1987
45
Reference (cont.)




Computer Arithmetic, Algorithms and Hardware
Design. Behrooz Parhami, Oxford University Press.
Advanced Computer Arithmetic Design. Michael J.
Flynn, et al. John Wiley & Sons, INC.
5 GHz 32b Integer-Execution Core in 130nm Dual-Vt
CMOS, ISSCC 2002
Implementation of a Self-Resetting CMOS 64-Bit
Parallel Adder with Enhanced Testability, JSSC Aug.
1999
2016/5/26
46