ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Download Report

Transcript ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

ECE 465
High Level Design Strategies
Lecture Notes # 9
Shantanu Dutt
Electrical & Computer Engineering
University of Illinois at Chicago
Outline
• Circuit Design Problem
• Solution Approaches:
– Truth Table (TT) vs. Computational/Algorithmic –
Yes, hardware, just like software can implement any
algorithm!
– Flat vs. Divide-&-Conquer
– Divide-&-Conquer:
• Associative operations/functions
• General operations/functions
– Other Design Strategies for fast circuits:
• Speculative computation
• Best of both worlds (best average and best worst-case)
• Pipelining
• Summary
Circuit Design Problem
• Design an 8-bit comparator that compares two 8-bit #s available in
two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F =
0 if A <= B.
• Approach 1: The TT approach -- Write down a 16-bit TT, derive logic
expression from it, minimize it, obtain gate-based realization, etc.!
A
00000000
B
00000000
F
0
00000000
00000001 0
-------------------00000001
00000000 1
---------------------11111111
11111111 0
–
–
–
–
Too cumbersome and time-consuming
Fraught with possibility of human error
Difficult to formally prove correctness (i.e., proof w/o exhasutive testing)
Will generally have high hardware cost (including wiring) and delay
Circuit Design Problem (contd)
• Approach 2: Think computationally/algorithmically about
what the ckt is supposed to compute:
• Approach 2(a): Flat computational/programming
approach:
– Note: A TT can be expressed as a sequence of “if-then-else’s”
– If A = 00000000 and B = 00000000 then F = 0
else if A = 00000000 and B = 00000001 then F=0
……….
else if A = 00000001 and B = 00000000 then F=1
……….
– Essentially a re-hashing of the TT – same problems as the TT
approach
Circuit Design Problem: Strategy 1: Divide-&-Conquer
•
Approach 2(b): Structured algorithmic approach:
– Be more innovative, think of the structure/properties of the problem that can be used
to solve it in a hierarchical or divide-&-conquer (D&C) manner:
Stitch-up of solns to A1
and A2 to form the
complete soln to A
Root problem A
Legend:
: D&C breakup arrows
: data/signal flow to
solve a higher-level
problem
: possible data-flow
betw. sub-problems
Data dependency?
Subprob. A1
A1,1
A1,2
Subprob. A2
A2,1
A2,2
Do recursively until
subprob-size
is s.t. TT-based
design is doable
– D&C approach: See if the problem can be:
• “broken up” into 2 or more smaller subproblems: two types of breaks possible
 by # of operands: partition set of n operands into 2 or more subsets of operands
 by operand size: breaking a constant # of n-bit operands into smaller size operands
(this mainly applies when the # of operands are a constant, e.g., add. of 2 #s)
• whose solns can be “stitched-up” (stitch-up function) to give a soln. to the parent prob
• also, consider if there is dependency between the sub-probs (results of some required
to solve the other(s))
– Do this recursively for each large subprob until subprobs are small enough (the leaf problem)
for TT solutions
– If the subproblems are of a similar kind (but of smaller size) to the root prob. then the
breakup and stitching will also be similar, but if not, they have to be broken up differently
Circuit Design Problem: Strategy 1: Divide-&-Conquer
•
Especially for D&C breakups in which: a) the subproblems are the same problem type as
the root problem, and b) there is no data dependency between subproblems, the final
circuit will be a “tree”of stitch-up functions (of either the same size or different sizes at
different levels—this depends on the problem being solved) with leaf functions at the
bottom of the tree, as shown in the figure below for a 2-way breakup of each
problem/sub-problem.
Stitch-up functions
Leaf functions
Shift Gears: Design of a Parity Detection Circuit—A Series of XORs
(a) A linearly-connected circuit
x(0)
x(1)
x(2)
X(3)
x(15)
f
• No concurrency in design (a)---the actual problem has available concurrency, though, and it is not
exploited well in the above “linear” design
• Complete sequentialization leading to a delay that is linear in the # of bits n (delay = n*td), td = delay of
1 gate
• All the available concurrency is exploited in design (b)---a parity tree (see next slide).
• Question: When can we have a circuit for an operation/function on multiple operands built of “gates”
performing the same operation for fewer (generally a small number betw. 2-5) operands?
• Answer:
(1) It should be possible to break down the n-operand function into multiple operations w/ fewer
operands.
(2) When the operation is associative. An oper. “x” is said to be associative if:
a x b x c = (a x b) x c = a x (b x c).
• This implies that, for example, if we have 4 operations a x b x c x d, we can either perform this as:
– a x (b x (c x d)) [getting a linear delay of 3 units or in general n-1 units for n operands]
– or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available
concurrency due to the fact that “x” is associative].
• Is XOR associative?
• The parenthesisation corresp. to the above ckt is:
– (…..((x(0) xor x(1)) xor x(2))) xor x(3)))) xor …. xor x(15))….)
Shift Gears: Design of a Parity Detection Circuit—A Series of XORs
x(15) x(14)
• if we have 4 operations a x b x c x d, we can
either perform this as a x (b x (c x d)) [getting a
linear delay of 3 units] or as (a x b) x (c x d)
[getting a logarithmic (base 2) delay of 2 units
and exploiting the available concurrency due to
the fact that “x” is associative].
• We can extend this idea to n operands (and
n-1 operations) to perform as many of the
pairwise operations as possible in parallel (and
do this recursively for every level of remaining
operations), similar to design (b) for the parity
detector [xor is an associative operation!] and
thus get a (log2 n) delay.
• In fact, any parenthesisation of operands is
correct for an associative operation/function,
but the above one is fastest. Surprisingly, any
parenthesisation leads to the same h/w cost: n1 2-i/p gates, i.e., 2(n-1) gate i/ps. Why? Analyze.
x(1) x(0)
w(3,4)
w(3,6)
w(2,3)
w(3,1)
w(3,3)
w(3,5)
w(3,7)
w(2,2)
w(1,1)
w(3,2)
w(2,1)
w(3,0)
w(2,0)
w(1,0)
Delay = (# of levels in
AND-OR tree) * td =
log2 (n) *td
w(0,0) = f
(b) 16-bit parity tree
An example of simple
designer ingenuity. A
bad design would
have resulted in a
linear delay, an
ingenious (simple
enough though) &
well-informed design
results in a log delay,
and both have the
same gate i/p cost
Parenthesization of tree-circuit: (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor
(x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))
D&C for Associative Operations
• Let f(xn-1, ….., x0) be an associative function.
• What is the D&C principle involved in the design of an n-bit xor/parity
function? Can it also lead automatically to a tree-based ckt?
f(xn-1, .., x0)
Stitch-up function---same as the
original function for 2 inputs
f(a,b)
a
f(xn-1, .., xn/2)
b
f(xn/2-1, .., x0)
• Using the D&C approach for an associative operation results in a breakup by #
of operands and the stitch up function being the same as the original function
(this is not the case for non-assoc. operations), but w/ a constant # of operands
(2, if the original problem is broken into 2 subproblems)
• Also, there are no dependencies between sub-problems
• If the two sub-problems of the D&C approach are balanced (of the same size or
as close to it as possible), then unfolding the D&C results in a balanced operation
tree of the type for the xor/parity function seen earlier of (log n) delay
D&C for Associative Operations (cont’d)
• Parity detector example
16-bit parity
w(0,0) = f
Breakup by operands
stitch-up
function = 2-bit parity/xor
8-bit parity
8-bit parity
w(1,1)
w(2,3)
Delay = (# of levels in
AND-OR tree) * td =
log2 (n) *td
w(2,2)
w(3,6)
w(3,7)
x(15) x(14)
w(1,0)
w(2,1)
w(3,4)
w(3,5)
w(2,0)
w(3,2)
w(3,3)
w(3,0)
w(3,1)
x(1) x(0)
D&C Approach for Non-Associative Opers: n-bit > Comparator
• O/P = 1 if A > B, else 0
• Is this is associative? Not sure for breakup by bits in the 2 operands. Issue of associativity mainly applies
for n operands, not on the n-bits of 2 operands
• For a non-associative func, determine its properties that allow determining a break-up & a
correct stitch-up function
• Useful property: At any level, comp. of MS (most significant) half determines o/p if result is > or < else
comp. of LS ½ determ. o/p
• Can thus break up problem at any level into MS ½ and LS ½ comparisons & based on their results
determine which o/p to choose for the higher-level (parent) result
• No sub-prob. dependency
A
Comp. A[7..0]],B[7..0]
Breakup by size/bits
A1
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 reslt is
> or < take
A1,1,1 reslt else
take A1,1,2 reslt
Small enough to be
designed using a TT
Comp A[7..4],B[7..4]
If A1,1 reslt is
> or < take
A1,1 reslt else
take A1,2 reslt
A1,1,2
Comp A[6],B[6]
If A1 reslt is
> or < take
A1 reslt else
take A2 reslt
A1,2
Comp A[5,4],B[5,4]
Stitch-up of solns to
A1 and A2 to form the
complete soln to A
Comp A[3..0],B[3..0]
A2
D&C Approach for Non-Associative Opers: n-bit > Comparator (cont’d)
A
Comp. A[7..0]],B[7..0]
Breakup by size/bits
A1
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 reslt is
> or < take
A1,1,1 reslt else
take A1,1,2 reslt
Comp A[7..4],B[7..4]
If A1,1 reslt is
> or < take
A1,1 reslt else
take A1,2 reslt
If A1 reslt is
> or < take
A1 reslt else
take A2 reslt
Stitch-up of solns to
A1 and A2 to form the
complete soln to A
Comp A[3..0],B[3..0]
A2
A1,2
Comp A[5,4],B[5,4]
A1,1,2
Comp A[6],B[6]
The TT may be derived directly or by first thinking of and expressing its
computation in a high-level programming language and then converting
it to a TT.
Small enough to be
designed using a TT
If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
A[i] B[i]
0 0
0 1
1 0
1 1
f1(i) f2(i)
0 1
0 0
1 0
0 1
(2-bit 2-o/p comparator)
/* f2(i) =1 means f1( ), f2( ) o/ps of parent should be that of the LS ½ of this subtree
should be selected by the stitch logic as its o/ps */
else if A[i] < B[i} then { f1(i) = 0; /* indicates < */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */
else if A[i] > B[i] then {f1(i) = 1; /* indicates > */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */
Comparator Circuit Design Using D&C (contd.)
• Once the D&C tree is formulated
it is easy to get the low-level &
stitch-up designs
• Stitch-up design shown here
A
Comp. A[7..0]],B[7..0]
A1
If A1 reslt is
> or < take
A1 reslt else
take A2 reslt
Comp A[7..4],B[7..4]
A1,1
Comp A[7..6],B[7..6]
A1,1,1
Comp A[7],B[7]
If A1,1,1 reslt is
> or < take
A1,1,1 reslt else
take A1,1,2 reslt
If A1,1 reslt is
> or < take
A1,1 reslt else
take A1,2 reslt
f1(i) f2(i)
0 1
0 0
1 0
0 1
A2
Comp A[3..0],B[3..0]
A1,2
Comp A[5,4],B[5,4]
A1,1,2
Stitch up logic details:
If f2(i) = 0 then { my_op1=f1(i);
my_op2=f2(i) } /* select MS ½ comp o/ps */
else /* select LS ½ comp. o/ps */
{my_op1=f1(i-1); my_op2=f2(i-1) }
Comp A[6],B[6]
my_op1 my_op2
A[i] B[i]
0 0
0 1
1 0
1 1
Stitch-up of solns to A1 and A2
to form the complete soln to A
my_op
Stitch-up
logic
2-bit
2:1 Mux
f2(i)
I0
2
f1(i) f2(i) f1(i-1) f2(i-1)
OR
2
I1
2
f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2
X 0
X
X
f1(i)
f2(i)
X 1
X
X
f1(i-1) f2(i-1)
f(i)=f1(i),f2(i) f(i-1)
(Direct design)
(Compact TT)
Comparator Circuit Design Using D&C – Final Design
• H/W_cost(8-bit comp.) =
7(H/W_cost(2:1 Muxes)) +
8(H/W_cost(2-bit comp.)
F= my1(6)
• H/W_cost(n-bit comp.) =
(n-1)(H/W_cost(2:1 Muxes)) +
n(H/W_cost(2-bit comp.))
my(5)
Log n level
of Muxes
I0
I0
I0
I1
my(4)(1)
my(5)(1)
my(4)
I0
2
2
my(2)
my(1)
2
I0
2
2
f(6)
2
I1
2
I0
2
2
f(5)
I1
2
my(0)
2
f(4)
2
I1
2
I0
2
2
f(3)
2
2-bit
f(1)(2) 2:1 Mux
2-bit
f(3)(2) 2:1 Mux
2-bit
f(5)(2) 2:1 Mux
I1
2
2-bit
my(1)(2) 2:1 Mux
I1
2
f(7)
2
my(5)(2)
2
2-bit
f2(7) = f(7)(2) 2:1 Mux
2
1-bit
2:1 Mux
2-bit
my(3)(2) 2:1 Mux
2
my(3)
• Delay(8-bit comp.) = 3 (delay of 2:1
Mux) + delay of 2-bit comp.
• Note parallelism at work – multiple
logic blocks are processing simult.
• Delay(n-bit comp.) = log n (delay of
2:1 Mux) + delay of 2-bit comp.
f(2)
2
I1
2
f(1)
2
f(0)
2
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
A[0] B[0]
D&C: Mux Design
2n :1
MUX
I 2n 1
Breakup by operands (data)
Simultaneous breakup by bits (select)
I0
Sn-1 S0
Two sets of operands: Data
operands (2n) and control/select
operand (n bits)
All bits except
msb should
have different
combinations;
msb should be
at a constant
value (here 0)
I0
2n-1 :1
MUX
Stitch-up
I 2 nn-11
MSB value should differ
among these 2 groups
Sn-2 S0
2:1
Mux
Sn-1
I2n-1
n
All bits except
msb should
have different
combinations;
msb should be
at a constant
value (here 1)
2n-1 :1
MUX
I 2 n 1
Sn-2 S0
(a) Top-Down design (D&C)
Opening up the 8:1 MUX’s hierarchical design and a top-down view
All bits except msb should have
different combinations; msb
should be at a constant value
(here 0)
MSB value should differ
among these 2 groups
I0
I1
I2
I3
I4
I5
I6
I7
I0
I1
I2
8:1
MUX Z

I3
All bits except msb should have
different combinations; msb
should be at a constant value
(here 1)
MUX
Selected when
S0 = 0, S1 = 1, S2=1
S0
2:1
I2
MUX
S0
I4
S2 S1 S0
2:1
I04:1 Mux
I5
2:1
MUX
I4
2:1
MUX
S1
I2
2:1
MUX
I6
2:1
I7
MUX
S0
I6
Z
S2
S1
Selected when S0 = 0, S1 = 1.
These i/ps should differ in S2
S0
I6
2:1
MUX
I6
4:1 Mux
Top-Down vs Bottom-Up: Mux Design
I0
2:1
I1
S0
I2
2:1
I3
S0
I 2 n 21
I 2 n 1
2n-1
2:1
MUXes
2n-1 :1
MUX
2:1
S0
Sn-1 S1
(b) Bottom-Up (“Divide-and-Accumulate”)
• Generally better to try top-down (D&C) first
An 8:1 MUX example (bottom-up)
Selected when S0 = 0
I0
I1
I2
I0
I1
I2
I3
I4
I5
I6
I7
8:1
MUX Z

I3
2:1
I0
MUX
S0
2:1
I2
MUX
I5
2:1
MUX
I3
I5
S0
I4
I1
These inputs should
have different lsb or S0
values, since their sel. is
based on S0 (all other
remaining, i.e., unselected
bit values should be the
same). Similarly for other
i/p pairs at 2:1 Muxes at
this level.
I6
2:1
I7
MUX
MUX Z
I4
S2 S1
S0
S2 S1 S0
4:1
I6
I7
S0
Selected when S0 = 1
Multiplier D&C
AXB:
n-bit
mult
Breakup by bits
(operand size)
Stitch up: Align and Add =
2n*W + 2n/2*X + 2n/2*Y + Z
n
AhXBh:
(n/2)-bit
mult
•
•
•
W
n
AhXBl:
(n/2)-bit
mult
X
Y
n
AlXBh:
(n/2)-bit
mult
n
Z
AlXBl:
(n/2)-bit
mult
Multiplication D&C idea:
A x B = (2n/2*Ah + Al)(2n/2*Bh + Bl), where Ah is the higher
Cost = 3 2n-bit
n/2 bits of A, and Al the lower n/2 bits
= 2n*Ah*Bh +
adders = 6n
Stitch-Up
Design
1
FAs (full
2n/2*Ah*Bl + 2n/2*Al*Bh + Al*Bl = PH + PM1 + PM2 + PL
(inefficient)
adders)
for
Example:
RCAs (ripple 10111001 = 185
PM2(n2n) carry adders)
PL(n2n)
X 00100111 = 39
+
+
= 0001110000101111 = 7215
2n-bit
4
PH(n2n)
PM1(n2n)
 D&C breakup: (10111001) X (00100111) = (2 (1011)
adders
+ 1001) X (24(0010) + 0111)
= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010)
+ 1001 X 0111
+
= 28(00010110) + 24(01001101 + 00010010) +
Critical path:
Delay (using RCAs) =
00111111
(a) too high-level
 = bbbbbbbb00111111 = PL
analysis: 2*((2n)-bit
+ bbbb01001101bbbb = PM1
What is the delay of
adder delay) = 4n*(FA
+ bbbb00010010bbbb = PM2
2n
delay)
the n-bit multiplier
(b) More exact
+ 00010110bbbbbbbb = PH
using such a stitch up
considering overall
_____________________
critical path: (i+2n(# 1)?
0001110000101111 = 7215
i+1) = 2n+1 FA delays
FA7
FA7
z7
z6
z5
z4
z3
z2
z1
FA7
Delay for adding 3 numbers X, Y, Z using two RCAs?
Ans: (n+1) FA delay units or 2(n+1) 2-i/p gate delay units
z0
•
Multiplier D&C (cont’d)
Ex: 10111001 = 185
X 00100111 = 39
Stitch-Up Design 2 (efficient)
= 0001110000101111 = 7215
n
D&C breakup: (10111001) X (00100111) = (24(1011) + 1001) X (24(0010) + 0111)
= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111
PL
= 28(00010110) + 24(01001101 + 00010010) + 00111111
+
(Arrows in adds on the
= bbbbbbbb00111111 = PL
PM1
+ bbbb01001101bbbb = PM1 left show Couts of
adds
@ del=n/2
+ bbbb00010010bbbb = PM2 lower-order
propagating as Cin ti
+ cin
+ 00010110bbbbbbbb = PH next higher-order adds)
Cout
000Cin
_____________________
+
0001110000101111 = 7215
PM2
• This (1.5n FA delay units)is the delay
assuming PL … PH have been
(n/2)-bit adders
@ del=n/2+1
computed.
• What is the delay of the entire
Cost = 5 (n/2)-bit
cin
multiplier? Note the stitch up of a level Adders = 2.5 n FAs
@ del=2[n/2] +1
can start when the lsb of the msb half
Intermediate
for RCAs
cin
of the product bits of each of the 4
Sums
+
products PL … PH are available: for the Critical path:
top level this is at n/4 + 2 after the
Delay =
PH
previous level’s such input is available 3*((n/2)-bit
+
• Using RCAs: (n-1) [this is delay after lsb adder delay) =
of msb half avail. at top level) + (n/2 1.5n*(FA delay)
00 ….0 Cin
+2) + (n/4 +2) + … + (2+2) (stopping at for RCAs
Cin @
@ del=2[n/2] +2
4-bit mult) + 2 [boundary-case 2-bit
del=2[n/2] +1
lsb of MS half
@ del=3[n/2] +1
mult delay at bit 3] + 1/3 [this is the
@ del=n/2+2
delay of 1 2-i/p gate translated in
n/2
n/2
n/2
n/2
terms of FA delay units which is 3 2-i/p
gate delays] = (n-1) + (1/2)[(S i=0 logn 2i
) + (logn +1) – 1.17 [corrective term for
We were able to obtain this similar-to-array-multiplier
taking prev. summation up to i=1,0] =
design using D&C using basic D&C guidelines and it did not
n-1 + (1/2)[2n-1] + 2(logn +1) - 1.17 ~
require an extensive ingenuity as it might have for the
2(n+log n ) ~ Q(2n) FA delays—similar
to the well-known array multiplier
designers of the array multiplier
that uses carry-save adders
2n
SU2 = Stitch up design 2 for
multiplication
SU2(n)
n
SU2(n/2)
n
SU2(n/2)
n
SU2(n/2)
n
SU2(n/2)
n/2
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
• What is its cost in terms of # of FAs (RCAs)?
• The level below the root (root = 1st level) has 4 (n/2)-bit multiplies to generate the PL …. PH
of the root, 16 (n/4)-bit multiplies in the next level, upto 2-bit mults. at level logn.
• Thus FAs used = 2.5[n + 4(n/2 )+ 16(n/4)] + 4 logn -1*(2) + 4 logn *(1/7) [the last two terns are
for the boundary cases of 2-bit and 1-bit multipliers that each require 2 and 1/7 FAs, resp.)
= 2.5n(S i=0 logn – 2 2i) + 2(n/2)2 + (1/7)n2 = 2.5[n(n/2 -1]/(2 -1)) + 0.64n2 = 1.25n2 -2.5n + 0.64n2
~ 1.89n2 = Q(n2).
• Why do we add (n/7) FA cost units for each 1-bit multiplier (which is a 2-i/p AND gate)?
• Using CSvA’s [see later], the cost is similar (quadratic in n, i.e., Q(n2)).
Multiplier D&C (cont’d): Carry-Save Addition
Multiplier D&C (cont’d): Carry-Save Add. Based Stitch-Up
S(PL)
C(PL)
CSvA
CSvA
S(PM1)
C(PM1)
CSvA
S(PM2)
No CSvA
needed
C(PM2)
S(PH)
C(PH)
Add 4 #s
using
CSvA’s
n/2 (C & S)
CSvA
Add 6 #s
using CSvA’s:
3 delay units
Add 7 #s
using CSvA’s
(7 lsb bits need
to be added): 4
delay units
n/2 (C & S)
n/2 (C & S)
Fig. : Stitch-up # 3: Adding 6 numbers in parallel
using CSvA’s takes 3 units of time and 4 CSvA’s.
n/2 (C & S)
Fig. : Separate (and thus parallel) Carry save adds for each of the 4
(n/2)-bit groups shown at the top level of multiplication
•
•
•
Using CSvAs (carry-save adders) [each sub-prod., e.g., PL, is formed of 2 nos. sum bits and carry bits, and so
there are 8 n-bit #s to be CSvA’ed in the final stitch-up and takes a delay of approx. 5 units if done in seq. but
only 4 units if done in parallel. We then get 2 final nos. (carries # and sums #) that are added by a carrypropagate adder like a CLA, which takes Q(log n) time, and overall multiplier delay is Q(4*log n) [4 time units
at each of the (log n -2) levels (need at least 2 bit inputs for the above structure to be valid) + at moat 2 time
units for the bottom two levels (why?)] + Q(log n) = Q(log n) —similar to Wallace-tree mult,
We were able to obtain this fast design using D&C (and did not need the extensive ingenuity that W-T
multiplier designers must have needed] !
Hardware cost (# of FAs), ignoring final carry-prop. adder for the entire mult.? Exercise.
D&C Example Where a “Straightforward” Breakup Does Not Work
• Problem: n-bit Majority Function (MF): Output f = 1 when a majority of bits is 1, else f =0
Root problem A:
n-bit MF [MF(n)]
f
f2
Subprob. A2
MF(MS n/2 bits)
St. Up
(SU)
f1
Subprob. A1
MF(LS n/2 bits)
• Need to ask (general Qs for any problem): Is the stitch-up function SU required in the
above straightforward breakup of MF(n) into two MF’s for the MS and LS n/2 bits:
 Computable?
 Efficient in both hardware and speed?
• Try all 4 combinations of f1, f2 values and check if its is possible for any function w/ i/ps
f1, f2 to determine the correct f value:
 f1 = 0, f2 = 0  # of 1’s in minority (<= n/4) in both halves, so totally # of 1’s <= n/2  f = 0
 f1 = 1, f2 = 1  # of 1’s in majority (> n/4) in both halves, so totally # of 1’s > n/2  f = 1
 f1 = 0, f2 = 1  # of 1’s <= n/4 in LS n/2 and > n/4 in MS n/2, but this does not imply if total
# of 1’s is <= n/2 or > n/2. So no function can determine the correct f value (it will need
more info, like exact count of 1’s)
 f1 = 1, f2 = 0: same situation as the f1 = 0, f2 = 1 case.
 Thus the stitch-up function is not even computable in the above breakup of MF(n).
D&C Example Where a “Straightforward” Breakup Does Not Work
(contd.)
• Try another breakup, this time of MF(n) into functions that are different from MF.
Root problem A:
n-bit MF [MF(n)]
f
Subprob. A2:
(> compare of A1 o/p
(log n)+1 f1
and floor(n/2)
Subprob. A1:
Count # of 1’s
in the n-bits
• Have seen (log n) delay (>) comparator for two n-bit #s using D&C
• Can we do 1-counting using D&C? How much time will this take?
Dependency Resolution in D&C:
(1) The Wait Strategy
• So far we have seen D&C breakups in which there is no data
dependency between the two (or more) subproblems of the breakup
• Data dependency leads to increased delays
• We now look at various ways of speeding up designs that have
subproblem dependencies in their D&C breakups
Root problem A
Subprob. A2
Subprob. A1
Data flow
• Strategy 1: Wait for required o/p of A1 and then perform A2, e.g.,
as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition
of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits
• No concurrency between A1 and A2:
t(A) = t(A1) + t(A2) + t(stitch-up)
= 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of
the same size (w/ different i/ps)
Adder Design using D&C
• Example: Ripple-Carry Adder (RCA)
– Stitching up: Carry from LS n/2 bits is
input to carry-in of MS n/2 bits at
each level of the D&C tree.
– Leaf subproblem: Full Adder (FA)
Add n-bit #s X, Y
Add MS n/2 bits
of X,Y
FA
FA
Add LS n/2 bits
of X,Y
FA
(a) D&C for Ripple-Carry Adder
FA
Example of the Wait Strategy in Adder Design
FA7
• Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of
transistors in either the up or down network = # of inputs  R’s of the transistors in series
add up and is prop to # of inputs  delay ~ RC (C is capacitive load) is prop. to # of inputs)
• The 5-i/p gate delay stated above for a FA is correct if we have 2-3 i/p gates available
(why?), otherwise, if only 2-i/p gates are available, then the delay will be 6-i/p gate delays
(why?).
• Assume each gate i/p contributes 2 ps of delay
• For a 16-bit adder the delay will be 160 ps
• For a 64 bit adder the delay will be 640 ps
Adder Design using D&C—Lookahead Wait (not in syllabus)
•
•
Example: Carry-Lookahead Adder
Add n-bit #s X, Y
(CLA)
– Division: 4 subproblems per
level
Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
Add
ms
n/4
bits
– Stitching up: A more complex
stitching up process
(generation of global ir
P, G
P, G
P, G
P, G
“super” P,G’s to connect up
Linear connection of local P, G’s from each unit to determine global or
super
P, G for each unit. But linear delay, so not much better than RCA
the subproblems)
– Leaf subproblem: 4-bit basic
(a) D&C for Carry-Lookahead Adder w/ Linear Global P, G Ckt
CLA with small p, g bits.
More intricate techniques (like P,G But, the global P for each unit is an associative function. So can be
done in max log n time (for the last unit; less time for earlier units).
generation in CLA) for complex
stitching up for fast designs may
need to be devised that is not
directly suggested by D&C. But
Add n-bit #s X, Y
D&C is a good starting point.
Add ms n/4 bits
P, G
Add 3rd n/4 bits Add 2nd n/4 bits
P, G
P, G
Add ls n/4 bits
P, G
Tree connection of local P, G’s from each unit to determine global
P, G for each unit (P is associative)
to do a prefix computation
(b) D&C for Carry-Lookahead Adder w/ a Tree-like
Global P, G Ckt
Dependency Resolution in D&C:
(2) The “Design-for-all-cases-&-select (DAC)” Strategy
Root problem A
00
Subprob. A1
Subprob. A2
I/p00
01
10
Subprob. A2
I/p01
Subprob. A2
I/p10
I/p11
11
Subprob. A2
4-to-1 Mux
• Strategy 2: DAC: For a k-bit i/p from A1 to A2,
design 2k copies of A2 each with a different
hardwired k-bit i/p to replace the one from A1.
• Select the correct o/p from all the copies of A2
via a (2k)-to-1 Mux that is selected by the k-bit
o/p from A1 when it becomes available (e.g.,
carry-select adder)
• t(A) = max(t(A1), t(A2)) + t(Mux) + t(stitch-up)
= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are
the same problems
Select i/p
(2) The “Design-for-all-cases-&-select (DAC)” Strategy (cont’d)
Root problem A
Subprob. A2
SUP
Subprob. A1
DAC
SUP
Subprob. A2
SUP
Subprob. A2
Subprob. A2
Subprob. A2
DAC
DAC
SUP
SUP
Wait
Wait
SUP
Wait
SUP
Wait
Generally, wait
strategy will be
used at all lower
levels after the 1st
wait level
Figure: A D&C tree with a mix of DAC and Wait strategies for dependency resolution between
subproblems
• The DAC strategy has a MUX delay involved, and at small subproblems, the delay of a subproblem
may be smaller than a MUX delay.
• Thus a mix of DAC and Wait strategies, as shown in the above figure, may be faster, w/ DAC used at
higher levels and Wait at lower levels.
Example of the DAC Strategy in Adder Design
Simplified Mux
Cout
1
4
• For a 16-bit adder, the delay is (9*4 – 4)*2 = 64 ps (2 ps is the delay for a single
i/p); a 60% improvement ((160-64)*100/160) over RCA
• For a 64-bit adder, the delay is (9*8 – 4)*2 = 136 ps; a 79% improvement over RCA
Dependency Resolution in D&C:
(3) Speculative Strategy
Root problem A
Subprob. A1
Subprob. A2
2-to-1 Mux
FSM Controller:
If o/p(A1A2) = guess(A2) then generate a
completion signal after some delay
corresponding to stitch up
else set i/p to A1 = o/p(A1 S2) and generate
completion signal after delay of A2 + stitch up
I1
I0
op(A1A2)
01
Estimate (guess),
based on
analysis or stats
select i/p to
Mux
• Speculative Strategy: Have a single copy of A2 but choose a highly likely value of the k-bit i/p
and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect,
re-do A2 w/ correct available value.
• t(A) = p(correct-choice)*(max(t(A1), t(A2)) + (1-p(correct-choice))*[t(A2) + t(A1)) + t(stitchup), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct.
• For t(A1) = t(A2), this becomes: t(A) = p(correct-choice)*t(A1) + (1-p(correct-choice))*2t(A1)+
t(stitch-up) = t(A1) + (1-p(correct-choice))*t(A1)+ t(stitch-up)
• Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs
• Need an FSM controller for determining if guess is correct and if not, then redoing A2
(allowing more time for generating the completion signal) .
Dependency Resolution in D&C:
(4) The “Independent Pre-Computation” Strategy
Concept
Example of an unstructured logic for A2
Root problem A
u v
x
w’ x
yw
z’ a1u’ x
a1
v’ x’
u v
x
w’ x
yw
z’
u’ x
Subprob. A1
A2_dep
Subprob. A2
v’ x’
Data flow
A2_indep
A2
A2_indep
Critical path after
a1 avail (8-unit delay)
a2
A2_dep
Critical path after
a1 avail (4-unit delay)
a2
• Strategy 4: Reconfigure the design of A2 so that it can do as much processing as possible that is
independent of the i/p from A1 (A2_indep). This is the “independent” computation that prepares for the final
computation of A2 (A2_dep) that can start once A2_indep and A1 are done.
• t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)
• E.g., Let a1 be the i/p from A1 to A2. If A2 has the logic a2 = v’x’ + uvx + w’xy + wz’a1 + u’xa1. If this were
implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is
available. If the logic is re-structured as a2= (v’x’ + uvx + w’xy) + (wz’ + u’x)a1, and if the logic in the 2
brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay
units after a1 is available.
• Such a strategy requires factoring of the external i/p a1 in the logic for a2, and grouping & implementing all
the non-a1 logic, and then adding logic to “connect” up the non-a1 logic to a1 as the last stage.
a1
D&C Summary
• For complex digital design, we need to think of the “computation”
underlying the design in a structured manner---are there properties of
this computation that can be exploited for faster, less expensive,
modular design; is it amenable to the D&C approach? Think of:
–
–
–
–
Breakup into >= 2 subprobs via breakup of (# of operands) or (operand sizes [bits])
Stitch-up (is it computable?)
Leaf functions
Dependencies between sub-problems and how to resolve them
• The design is then developed in a structured manner & the
corresponding circuit may be synthesized by hand or described
compactly using a HDL (e.g., structural VHDL)
• For an operation/func x on n operands (an-1 x an-2 x …… x a0 ) if x is
associative, the D&C approach gives an “easy” stitch-up function,
which is x on 2 operands (o/ps of applying x on each half). This results
in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized.
• If x is non-associative, more ingenuity and determination of properties
of x is needed to determine the breakup and the stitch-up function. The
resulting design may or may not be tree-structured
• If there is dependency between the 2 subproblems, then we saw
strategies for addressing these dependencies:
–
–
–
–
Wait (slowest, least hardware cost)
Design-for-all-cases (high speed, high hardware cost)
Speculative (medium speed, medium hardware cost)
Independent pre-computation (medium-slow speed, low hardware cost)
Strategy 2: A general view of DAC
computations (w/ or w/o D&C)
•
•
•
If there is a data dependency between two
or more portions of a computation (which
may be obtained w/ or w/o using D&C),
don’t wait for the the “previous” computation
to finish before starting the next one
Assume all possible input values for the
next computation/stage B (e.g., if it has 2
inputs from the prev. stage there will be 4
possible input value combinations) and
perform it using a copy of the design for
possible input value.
All the different o/p’s of the diff. Copies of B
are Mux’ed using prev. stage A’s o/p
E.g. design: Carry-Select Adder (at each
stage performs two additions one for carryin of 0 and another for carry-in of 1 from the
previous stage)
A
B
y
(a) Original design: Time = T(A)+T(B)
x
0
A
B(0,0)
y
0
0
B(0,1)
4:1 Mux
•
z
x
1
1
z
B(1,0)
0
1
B(1,1)
1
(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux).
Works well when T(A) approx = T(B) and T(A) >> T(Mux)
Strategy 3: Get the Best of Both Worlds
(Average and Worst Case Delays)!
Approximate analysis: Avg.
dividend value = 2n-1
For divisor values in the
“lower half range”[1, 2n-1],
the average quotient value
is the Harmonic series (1+
½ + 1/3 + … + 1/ 2n-1) ~ ln
(2n-1) ~( n-1)/1.4
(integration of 1/k from 1 to
2n-1)
Quotient for divisors in the
upper half range [2n-1 +1,
2n] is 0
 overall avg. quotient =
(n-1)/2.8  avg.
subtractions needed
= 1 + (n-1)/2.8 = Q(n/2.8)
•
•
•
•
Registers
inputs
inputs
Unary
Division Ckt
(good ave
case: Q(n/2.8)
subs,
done1
bad
worst case:
Q(2n) subs)
output
select
start
Ext.
FSM
done2
Mux
NonRestoring
Div. Ckt
(bad ave
case [Q(n)
subs],
good
worst case:
Q(n) subs)
output
Register
Use 2 circuits with different worst-case and average-case behaviors
Use the first available output
Get the best of both (ave-case, worst-case) worlds
In the above schematic, we get the good ave case performance of unary
division (assuming uniformly distributed inputs w/o the disadvantage of its bad
worst-case performance): ave. case = Q(1) subs, worst case = Q(n) subs
Strategy 4: Pipeline It!
Clock
Registers
Stage 1
Original ckt
or datapath
Stage 2
Conversion
to a simple
level-partitioned
pipeline (level
partition may not
always be possible
but other pipelineable partitions
may be)
Stage k
• Throughput is defined as # of outputs / sec
• Non-pipelined throughput = (1 / D), where D = delay of original ckt’s datapath
• Pipeline throughput = 1/ (max stage delay + register delay)
• Special case: If original ckt’s datapath is divided into n stages, each of equal delay, and
dr is the delay of a register, then pipeline throughput = 1/((D/n)+dr).
• If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the
original ckt
• FSM controller may be needed for non-uniform stage delays; not needed otherwise
Strategy 4: Pipeline It! (contd.)
•
Legend
: Register
F= my1(6)
I0
Log n level
of Muxes
I0
I0
my(4)
I0
2
2
my(2)
my(1)
2
I0
2
2
f(6)
2
I1
2
I0
2
2
f(5)
I1
2
my(0)
2
f(4)
2
I1
2
I0
2
2
f(3)
2
2-bit
f(1)(2) 2:1 Mux
2-bit
f(3)(2) 2:1 Mux
2-bit
f(5)(2) 2:1 Mux
I1
2
2-bit
my(1)(2) 2:1 Mux
I1
2
f(7)
2
my(4)(1)
my(5)(1)
2
2-bit
f2(7) = f(7)(2) 2:1 Mux
2
I1
2-bit
my(3)(2) 2:1 Mux
2
my(3)
•
1-bit
2:1 Mux
my(5)(2)
my(5)
Comparator o/p produced every 1 unit of
time, instead of every (logn +1) unit of
time, where 1 time unit here = delay of
mux or 1-bit comparator (both will have
the same or similar delay)
We can reduce reg. cost by inserting at
every 2 levels, throughput decreases to 1
per every 2 units of time
f(2)
2
I1
2
f(1)
2
f(0)
2
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
1-bit
comparator
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
A[0] B[0]
Strategy 4: Pipeline It! (contd.)
Legend
: Intermediate & output register
: Input register
Pipelined Ripple Carry Adder
Problem: I/P and O/P data direction is not the same as the computation direction.
They are perpendicular!
Next 3
S0, S1 o/ps
S7, S6 o/ps
for i/ps recvd
4 cc back
S5, S4 o/ps
for i/ps recvd
4 cc back
S3, S2 o/ps
for i/ps recvd
4 cc back
S1, S0 o/ps
for i/ps recvd
4 cc back
Adder o/p produced every 2 unit’s of FA delay instead of every n units of FA
delay in an n-bit RCA