Document

Transcript Document

A Timing-Driven Synthesis
Approach of a Fast
Four-Stage Hybrid Adder in
Sum-of-Products
Sabyasachi Das
University of Colorado, Boulder
Sunil P. Khatri
Texas A&M University
1
What is a Sum-of-Product (SOP)


An arithmetic Sum-of-Product block (SOP) consists
of an arbitrary number of product terms and sum
terms.
General form of SOP:
z  (a1  b1  c1  d1  e1  )  (a2  b2  c2  d2  e2  )     
a
b
p=a*b
p
c
d
e
f
q=c*d
q
z=p+q+e+f
z
2
Examples of SOP Blocks

Multiplier




{assign z = a + b + c + d}
found in ALU, Wireless applications
Generalized SOP

{assign z = a * a}
found in DSP processors
Addition Tree

{assign z = (a * b) + c}
found in Cryptographic Applications
Squarer


found in Microprocessors
Multiply-Accumulator

{assign z = a * b}
{assign z = (a * b) + (c * d)}
found in FIR filters, IIR filters
3
Synthesis of Sum-of-Products

Synthesis of Sum-ofProduct blocks is done in 3
steps (in the order of dataflow)

Creation of Partial Products

Reduction of Partial
Products into 2 operands

Computation of Final Sum
by adding the 2 operands
Inputs
Creation of
Partial Products
Reduction of
Partial Products
Computation of
Final Sum
Output
4
Motivation and Problem Statement

SOP blocks are widely used and computationally-intensive

Final adder in SOP consumes about 30% to 40% delay of
the SOP block. This paper focuses on the synthesis of an
efficient final adder for a SOP expression

Stand-alone adder architectures do not work well in SOP
5
Stand-alone Adder Architectures

Frequently used adder architectures

Ripple-Carry
Area-efficient, but slow
 Timing-efficient if inputs have skewed arrival time


Parallel-Prefix architecture (Brent-Kung, Kogge-Stone)
Faster architecture
 Requires more area


Carry-Select
Large area overhead (often >100%)
 Better delay if Cin signal arrives late.


None of these are very suitable in Sum-of-Products

Why?
6
Special Arrival-time Property

The 2 operands of the final
adder in a SOP exhibit a
peculiar arrival time pattern
A rrivalT im es ofinputx
As a result, traditional
monolithic adders do not work
well in SOP


Optimized for equal arrival times
Hence, hybrid adders are
required, which exploit this
arrival-time pattern
A rrivalT im es ofinputy
1000
800
A rrivalT im e

600
400
200

Hence it is critical to
synthesize an efficient
hybrid adder which is
designed specifically for
SOP blocks
4
6
8
10
12
14
16
18
B itN um ber
7
Proposed 4-Stage Hybrid Adder
w4
w4
SubAdder4
CarrySelect
w4
w3
w3
SubAdder3
CarrySelect
w3
w2
w2
SubAdder2
KoggeStone
w2
w1
w1
SubAdder1
RippleCarry
w1

Ripple-Carry architecture near LSB
Fast Kogge-Stone architecture near Middle
2 Carry-Selects (based on Brent-Kung) near MSB

GOAL : Find w1 , w2 , w3 and w4 algorithmically


8
Notations

We use the following notations:








The bit-width of SubAdder1 (Ripple) is w1 bits
The bit-width of SubAdder2 (Kogge-Stone) is w2 bits
The bit-width of SubAdder3 (Carry-Select, Brent-Kung) is w3 bits
The bit-width of SubAdder4 (Carry-Select, Brent-Kung) is w4 bits
w1 + w2 + w3 + w4 = n (total width of the hybrid adder)
T(ai) = Time when input signal ai is available
T(Si) = Time when output signal Si (Sumi) is available
T(Ci) = Time when output signal Ci (Carryi) is available
9
SubAdder1 (Ripple-Carry)
xk yk
zk+1



x2 y2 x1 y1 x0 y0
FA
FA
FA
FA
zk
z2
z1
z0
Most area-efficient architecture
Very slow
Timing-efficient if input arrival time is
skewed. We use it for a few bits near LSB
(which arrive earliest)
10
Parallel-Prefix Adders (KS, BK)

In a Parallel-Prefix adder, Carry for each bit is computed by an
efficient tree-structure (using the Generate and Propagate
concept).

For each bit i of the adder, Generate (Gi) indicates whether a
carry is generated from that bit


bi
For each bit i of the adder, Propagate (Pi) indicates whether a
carry is propagated through that bit


Gi = a i
Pi = ai
bi
The Generate and Propagate concept is extendable to blocks
comprising multiple bits, as we discuss next
11
Parallel-Prefix Adders (KS, BK)

If two blocks (comprising one or more bits) have the GP valuepairs as (Gleft, Pleft) and (Gright, Pright), then the combined block
has the GP values as follows:


Gleft, right = Gleft
Pleft, right = Pleft
(Pleft
Pright
Gright)

The above computation is performed
by a carry-operator or ”o”-operator

Once we obtain carry for each bit,
it is trivial to compute the sum
output of each bit (XOR and NAND)
(Gleft, Pleft) (Gright, Pright )
(Gleft, right, Pleft, right )
12
SubAdder2 (Kogge-Stone)
GP7 GP6 GP5 GP4 GP3 GP2 GP1 GP0
C8

C7
C6
C5
C4
C3
C2
C1
Kogge-Stone Parallel prefix architecture


Delay: log2n levels of ”o”-operator
Area: (n*log2n)-n+1 number of ”o”-operator
Kogge and Stone, “A parallel algorithm for the efficient solution of a general
class of recurrence equations”, In IEEE transaction for Computers, 1973
13
Brent-Kung (BK)
GP7 GP6 GP5 GP4 GP3 GP2 GP1 GP0
C8

C7
C6
C5
C4
C3
C2
C1
Brent-Kung Parallel prefix architecture


Delay: (2*log2n)-2 levels of ”o”-operator
Area: (2*n)-2-log2n number of ”o”-operator
Brent and Kung, “A regular layout for parallel adders”, In IEEE transaction for
Computers, 1982
14
SubAdder3 & SubAdder4 (Carry-Select)
x
Adder1
z1



x
y
1’b1
Mux
y
Adder0
1’b0
z0
cin
z
Large area overhead
Used as a special case, since Cin arrives late
Speed depends on the architecture of two adders


But these adders need not be KS (rather, we use BK)
The arrival times of the inputs of SubAdder3 and SubAdder4
are earlier than those for SubAdder2
15
Determination of width of SubAdder1

Width of the Ripple adder (SubAdder1)
 At
every bit (i), compute T(Ci+1) and check if
 T(Ci+1)
≤ T(ai+1)
 T(Ci+1) ≤ T(bi+1)
 If
check passes, i = i+1
 Else continue checking until 3 consecutive
bits fail the check (Hill Climbing)
 Return the value i as the Ripple Adder width
16
Determination of width of SubAdder2

Width of Kogge-Stone Adder (SubAdder2)
 The
latest arriving signals are part of this adder
 Hence keep this adder wide, while ensuring that
this does not result in a very narrow CarrySelect adder for SubAdder3 and SubAdder4
 We determine the widths with the following
equation:
 w2
= n – w1
if (n-w1) ≤ 8
p
 w2 = 2 , where p = log2 (n-w1)
if (n-w1) > 8
 Example: If n=32 and w1=7 then w2=16
17
Delay of the Hybrid Adder
w4
w4
SubAdder4
CarrySelect
w4
T(C4)
T(S4)
w3
w3
SubAdder3
CarrySelect
w3
T(S3)
w2
w2
SubAdder2
KoggeStone
w1
w1
SubAdder1
RippleCarry
w2
w1
T(S2)
Thybrid = max (T(C4), T(S4), T(S3), T(S2))
18
Determination of widths of
SubAdder3 and SubAdder4

Width of the two Carry-Select adders

Initial width configuration
w3 = (n-w1-w2)/2
 w4 = (n-w1-w2-w3)

With this initial configuration, estimate delay of the
overall hybrid adder (based on the previous slide)
 Use an iterative approach to explore in the
appropriate direction (similar to Binary Search) and
converge on the smallest delay configuration

19
Experimental Setup

To test our approach, we used:
Adders in several different types of SOP blocks
(Multipliers, MAC, generalized SOP and Squarer)
 Two process technologies (0.13µ and 0.09µ)
 Two commercial library vendors
 Two different arrival time constraints


We compared the results of our hybrid adder
with the adder produced by a commercial
datapath synthesis tool.
20
Results
1400
Worst-case Delay (ps)
1200
1000
800
600
400
200
0
Adder-75
Adder-35
Adder-68
Adder-57
Adder-47
Adder-61
Adder-89
Name of The Adder
Delay of the Adder Produced by Commercial Tool
Delay of Our Proposed Adder
On an average, 14.31% faster than the result of the
commercial Synthesis tool (with 6.62% area penalty)
21
Summary






Hybrid adder consists of 4 SubAdders
SubAdder1 has Ripple-Carry architecture
SubAdder2 has Kogge-Stone architecture
SubAdder3 and SubAdder4 have Carry-Select
(based on Brent-Kung) architecture
Widths of all SubAdders are computed based on
a timing-driven analysis
On an average, 14.31% faster (with 6.62% area
penalty)
22
Thank you
23

Document

Transcript Document

Directory