CMOS VLSI Design CMOS VLSI Design 4th Ed.

Download Report

Transcript CMOS VLSI Design CMOS VLSI Design 4th Ed.

Lecture 18: Datapath Functional Units

Outline

     Multi-input Adders Comparators Shifters Multipliers More complex operations

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Carry-Save Adders (CSA)

 You can use a carry save adder to add three

-bit operands

A 0

A 1

, and carry propagation.

A 2

without performing any

  2

c i

 1  

s i C

 

S a



0 

 

A a

1 2,

  0,1,  ,

 1

2  You can also add one

-bit operand to an

-digit  

out



  

 Result is in carry-save format.



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Carry-Save Adders

 Parallel arrangement of full-adders => constant delay.

 7

 4

 Multi-operand carry-save adders also possible (

>3) – Array or tree arrangement.



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Multi-Operand Adders

    Add three or more (

m>2

)

-bit operands.

  representation Array adders – Linear arrangement of CPAs – Linear arrangement of CSAs and a final CPA • The final CPA has to be fast. If it is an RCA, the performances of the two alternatives are equal.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

4-Operand CPA Array

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

4-Operand CSA Array

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Multi-input Adders

  Suppose we want to add k N-bit words – Ex: 0001 + 0111 + 1101 + 0010 = 10111 Straightforward solution: k-1 N-input CPAs – Large and slow 0001 0111 1101 0010 1000 + 10101 + + 10111

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Carry Save Addition

  A full adder sums 3 inputs and produces 2 outputs – Carry output has twice

weight

of sum output N full adders in parallel are called

carry save adder

– Produce N sums and N carry outs X 4 Y 4 Z 4 X 3 Y 3 Z 3 X 2 Y 2 Z 2 X 1 Y 1 Z 1 C 4 S 4 C 3 S 3 C 2 S X N...1

Y N...1

Z N...1

2 n-bit CSA C N...1

S N...1

18: Datapath Functional Units

C 1 S 1

CMOS VLSI Design 4th Ed.

CSA Application

  Use k-2 stages of CSAs – Keep result in carry-save redundant form Final CPA computes actual result 0001 0111 1101 0010 0001 0111 +1101 1011 0101_ 0101_ 4-bit CSA 1011 5-bit CSA 01010_ 00011 0101_ 1011 +0010 00011 01010_ + 10111 01010_ + 00011 10111 X Y Z S C X Y Z S C A B S

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

(m,2)-Compressors

2  



 4 

 0

l c out

 



 1 

 0

a i



 4 

 0

l c in

   

-bit adders. Similar to

(m,k)

-counters.

Compresses

bits down to

by forwarding intermediate carries to next higher position.

(m-3)

No horizontal carry propagation.

Built from full adders ((3,2) compressors) or (4,2) compressors arranged in linear or tree structures,

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

(m,2)-Compressors

 Example: 4-operand adder using (4,2) compressors.

 7 

 2  ,

T LIN

 4 

 2  ,

T TREE

 

  1  

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

(m,2)-Compressors

 Structure of a (4,2) compressor

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

(m,2)-Compressors

  Advantages of (4,2)-compressors over FAs for realizing (m,2)-compressors: – Higher compression rate.

– Less deep and more regular trees.

Example: (8,2) compressor.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Tree adders (Wallace Tree)

  Adder tree:

-bit

-operand carry-save adder composed of

tree structured

(m,2)

compressors.

Fastest multi-operand adders using an adder tree and a fast final CPA.

T A

 

(

,2) 



,2 

 

T CPA A CPA

 

 log 

 log

log



 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Adder Arrays and Trees

     Some FAs can often be replaced by HAs or eliminated altogether.

Number of FAs does not depend on adder structure, but number of HAs does.

-operand adder accomdates

(m - 1)

carry inputs.

Adder trees

(T = O(

log

n))

are faster than adder arrays

(T = O(n))

at the same amount of gates

(A = O(mn))

Adder trees are less regular and have more complex routing than adder arrays => larger area, difficult layout => limited use in layout generators.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Sequential Adders

     Bit-serial adder



A FA



A FF



T FA



T FF



Accumulators – With CPA



A CPA



A REG



T CPA



T REG



– With CSA and final CPA



A CSA



A CPA

 4

A REG



T CSA



T REG



• Allows higher clock rates • Final CPA too slow – Pipelining or multiple cycles

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 17

Complement and Subtraction

   2’s complement 



 1 2’s complement subtracter





 



 1 2’s complement adder/subtracter 

 

B A

  

 



sub

 

sub B sub

 1’s complement adder 



 mod 2

 1   (end - around carry)



c out

18: Datapath Functional Units CMOS VLSI Design 4th Ed.



Subtraction

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Increment/Decrement

 Adds a single bit

c in



c out

to an

-bit operand

 

c out







c in z i c i

 1 

a i

 

c i a i c i

;

i c

0 

c in

c out

  0,  ,

n c n

 1 (r.m.a)   Corresponds to addition with

B=0

(FA -> HA)

 3



 1 ,

 3

2 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Increment/Decrement

 Or, use incrementer slices  Prefix problem

C i:k = C i:j+1 C j:k

=> AND prefix structure

 1 2

log

 2

  log

  2 ,

 1 2

log 2



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Increment/Decrement

 Decrementer 

c out

 



c in

  Incrementer-decrementer 

c out

 



c in



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Fast Incrementers

 4-bit incrementer using multi-input gates  8-bit parallel-prefix incrementer (Sklansky-AND prefix structure)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Gray Incrementer

 Increments in Gray number system

c i

 1  

a n

 1

a i c i

 ;

i a n

 2   0,  ,



0 (parity) 3 (r.m.a)

z z z i

  1 

a i



0 

a i

 1

c i

 1 ;



a n

 1 

c n

 2  1,  ,

 2  Prefix problem => AND-prefix structure 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Counting

   Count clock cycles => counter Divide clock frequency => frequency divider (

c out

) Binary counter – Sequential incrementer/decrementer – Incrementer speed-up techniques applicable – Down-and up-down counters using incrementers or incrementer-decrementers

25 18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Example

 Ripple-carry up-counter using counter slices (HA+FF),

c in

is count enable.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Synchronous Counters

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Synchronous Counters

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Asynchronous Counters

 Uses toggle flip-flops.

– Lower toggle rate => lower power

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Gray Counter

 Counter using Gray incrementers

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Fast Counters

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Fast Counters

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Ring Counters

 Shift register connected to ring    State is not encoded =>

FF for counting

states.

Must be initialized correctly.

Applications: – Fast dividers (no logic between FF) – State counter for one-hot coded FSMs

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Johnson Counter

  Inverted feedback

FF for counting

states.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

3-bit LFSR

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

3 4 5 6 7

Cycle

0 1 2 1 0 1 1 1

1 0 0

3-bit LFSR

0 1 0 1 1

1 1 0 0 0 1 0 1

Q2/Y

1 1 1

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

8-bit LFSR

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Comparators

 Comparison operations

EQ NE

  



 

B B

 (equal)  

E Q

(not equal)

GE LT

  



 

 (greater or equal)

 

G E

(less than)

GT LE

  



 

 



G T



E Q

(greater than)

G E



(less or equal) 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Comparators

    0’s detector: 1’s detector: Equality comparator: A = 00…000 A = 11…111 A = B Magnitude comparator: A < B

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

1’s & 0’s Detectors

  1’s detector: N-input AND gate 0’s detector: NOTs + 1’s detector (N-input NOR) A 7 A 6 A 5 A 4 A 3 A 2 A 1 A 0 A 7 A 6 A 5 A 4 A 3 A 2 A 1 A 0 allones A 3 A 2 A 1 A 0 allones allzeros

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 40

Equality Comparison

i EQ eq i

 1  

a i

 



  

a i



b i



b i



eq i

; 

eq i

 0,  ,

 1

0  1 ,



eq n

(r.s.a) 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Equality Comparator

  Check if each bit is equal (XNOR, aka equality gate) 1’s detect on bitwise equality B[3] A[3] B[2] A[2] B[1] A[1] B[0] A[0] A = B

18: Datapath Functional Units CMOS VLSI Design 4th Ed.



Magnitude Comparison

GE ge i

 1    



a i

 

a i b i

 

a i B



b i

  

a i



b i



ge i

;

 

b i



ge i

0,  ,

n ge

0  1 ,



ge n

(r.s.a.)  1

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Magnitude Comparator

   Compute B – A and look at sign B – A = B + ~A + 1 For unsigned numbers, carry out is sign bit



C N



B 3 A 3 B 2 A 2 B 1 A 1 B 0 A 0 Z A = B

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 44

Comparators

 Subtractor

(A-B) GE



c out EQ



A RCA



P n

 1:0 7

T RCA

 2

A PPA



 3

log 2

T PPA



 2log

 Optimized comparator – Removing redundancies in subtractor (unused

s i

) – Single-tree structure => speed up at no cost

 6

T LIN

 2

T TREE

 2log

18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Comparators

 Example: ripple comparator using comparator slices

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Signed vs. Unsigned

 For signed numbers, comparison is harder – C: carry out – Z: zero (all bits of A – B are 0) – N: negative (MSB of result) – V: overflow (inputs had different signs, output sign  B) – S: N xor V (sign of result)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Decoder

 Decodes binary number

A n-1:0

to vector

Z m-1:0

(

m= 2 n

)

z i

    1 if



0 else ;

 0,  ,

 1

 2



 

 1  2

  log

 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Encoder

 Encodes vector A m-1:0 to binary number Z n-1:0 (

m =2 n

) 

Z i

 

i k

if if

a i k



then

a k

 1 ;

 0,  , 

1 else  1

a Z k

 0  log 2



T A

 

n n

 2

 1  1  1  

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Detection Operations

  All-zeroes detection All-ones detection



a n

 1 

a n

 2 



a n

 1

a n

 2 

0 



 log

         

 2





18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Shift,Extension,Saturation

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Shift,Extension,Saturation

  Applications – Adaption of magnitude or word length of operands.

– Multiplication/division by multiples of 2 – Logic bit/byte operations – Scaling of numbers for word length reduction – Reducing error after under-/overflow Implementation of shift/extension/rotation by – Constant values: hard-wired – Variable values: multiplexers –

possible values:

-by-

barrel-shifter/rotator

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Shifters

   Logical Shift: – Shifts number left or right and fills with 0’s • 1011 LSR 1 = 0101 1011 LSL1 = 0110 Arithmetic Shift: – Shifts number left or right. Rt shift sign extends • 1011 ASR1 = 1101 1011 ASL1 = 0110 Rotate: – Shifts number left or right and fills with lost bits • 1011 ROR1 = 1101 1011 ROL1 = 0111

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Funnel Shifter

  A funnel shifter can do all six types of shifts Selects N-bit field Y from 2N –1-bit input – Shift by k bits (0  k < N) – Logically involves N N:1 multiplexers

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Funnel Source Generator

Rotate Right Logical Right Arithmetic Right Rotate Left Logical/Arithmetic Left

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Array Funnel Shifter

 N N-input multiplexers – Use 1-of-N hot select signals for shift amount – nMOS pass transistor design (V t k[1:0] drops!) left Inverters & Decoder s 3 s 2 s 1 s 0 Z 6 Z 5 Z 4 Z 3 Z 2 Z 1 Z 0

18: Datapath Functional Units

Y 3 Y 2 Y 1 Y 0

CMOS VLSI Design 4th Ed.

Logarithmic Funnel Shifter

 Log N stages of 2-input muxes – No select decoding needed

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

32-bit Logarithmic Funnel

  Wider multiplexers reduce delay and power Operands > 32 bits introduce datapath irregularity

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Barrel Shifter

   Barrel shifters perform right rotations using wrap around wires.

Left rotations are right rotations by N – k = k + 1 bits.

Shifts are rotations with the end bits masked off.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

4-by-4 Barrel Rotator

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Logarithmic Barrel Shifter

Right shift only Right/Left shift

18: Datapath Functional Units

Right/Left Shift & Rotate

CMOS VLSI Design 4th Ed.

32-bit Logarithmic Barrel

  Datapath never wider than 32 bits First stage preshifts by 1 to handle left shifts

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Binary Shifter

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Barrel Shifter

Area dominated by wiring

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

4x4 barrel shifter

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Logarithmic Shifter

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

0-7 bit Logarithmic Shifter

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Addition Flags

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Adder with Flags

   

: for free

: fast

c n

c n-1

computed by PPA => Very cheap

c in =1

(subtract.):

Z = (A=B) = P n-1:0

of PPA c in = 0/1

Z A

 

s n

 1

A CPA

 

s n

 2

, 

T Z



T s

0 (r.s.a)

CPA

  log

  Faster without final sum 

z i

    

0 

a i



0 

b i

 

c in

  

a i

 1  

b i

 1  



z n

 1

z n

 2 

0 ;

A = A CPA  3

T Z

 0,  4    ,

log  1 (r.s.a.)



18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Condition Flags

 Signed and unsigned addition/subtraction differ only with respect to condition flags

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

ALU

 Arithmetic Logic Unit (ALU)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

ALU Operations

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Multiplication

 Example: 1100 : 12 10 0101 : 5 10 1100 0000 1100 0000 00111100 : 60 10 multiplicand multiplier partial products product  M x N-bit multiplication – Produce N M-bit partial products – Sum these to produce M+N-bit product

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

General Form

  Multiplicand: Multiplier:  Product: Y = (y M-1 , y M-2 , …, y 1 , y 0 ) X = (x N-1 , x N-2 , …, x 1 , x 0 )

   

j M

  1  0

y j

 

 1

  0

x i

  

 1



 0

 1  0

x y i j

2 multiplicand multiplier p 11 x 5 y 5 p 10 x 4 y 5 x 5 y 4 p 9 x 3 y 5 x 4 y 4 x 5 y 3 p 8 x 2 y 5 x 3 y 4 x 4 y 3 x 5 y 2 p 7 x 1 y 5 x 2 y 4 x 3 y 3 x 4 y 2 x 5 y 1 p 6 y 5 x 5 x 0 y 5 x 1 y 4 x 2 y 3 x 3 y 2 x 4 y 1 x 5 y 0 p 5 y 4 x 4 x 0 y 4 x 1 y 3 x 2 y 2 x 3 y 1 x 4 y 0 y 3 x 3 x 0 y 3 x 1 y 2 x 2 y 1 x 3 y 0 p 4 p 3 y 2 x 2 x 0 y 2 x 1 y 1 x 2 y 0 y 1 x 1 x 0 y 1 x 1 y 0 y 0 x 0 x 0 y 0 p 2 p 1 p 0 partial products product

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 74

Binary Multiplication

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Dot Diagram

 Each dot represents a bit partial products x 0

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

x 15

Array Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Array Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Carry Save Array Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

x 0 x 1 x 2 x 3

Array Multiplier

y 3 y 2 y 1 y 0 CSA Array CPA p 7 p 6 p 5 A B p 4 p 3 B Sin A Cin Cout Sout = Cout Sin Cin Sout critical path Cout p 2 p 1 p 0 A Sout B Cin = Cout A Sout B Cin

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Rectangular Array

 Squash array to fit rectangular floorplan y 3 y 2 y 1 y 0 x 0 p 0 x 1 p 1 x 2 p 2 x 3 p 3

18: Datapath Functional Units

p 7 p 6 p 5 p 4

CMOS VLSI Design 4th Ed.

Multiplier Floorplan

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Sequential Multipliers

 Partial products generated and added sequentially using an accumulator.

   ,



 log

 ,





18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Array Multipliers

 Partial products generated and added simultaneously in linear array using array adder.

   ,



CMOS VLSI Design 4th Ed.

Multiplication Algorithm

   Generation of partial products Adding up partial products – Sequentially (sequential shift and add) – Serially (combinational shift and add) – In parallel Speed-up techniques – Reduce the number of partial products – Accelerate addition of partial products

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Parallel Multipliers

 Partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)

   ,



 log



18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Signed Multipliers

   What about signed multiplication?

– Complement operands before and result after multiplication => unsigned multiplication – Direct implementation (dedicated signed multipliers.

Unsigned array multiplier using CSA and a final CPA is sometimes called Braun multiplier.

The unit gate model yields for a CPA of type RCA

A T

 8

2  11

 6

 9

18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Modified Braun Multiplier

   For multiplying two’s complement numbers Sometimes called Pezaris multiplier Subtract bits with negative weight => special FA’s 1 neg. bit :  2 neg. bits :

a a



c in



c in

 2

c out

  2

c out





 Otherwise, exactly same structure and complexity as the Braun multiplier => efficient and flexible 6

 

7 2 7  

 0

a i

i B

 

7 2 7  6 

 0

b i

18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Modified Braun Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Modified Braun Multiplier

     Type-1 adder has one negative input, the sum output is negative.

Type 2 adder has two negative inputs, the carry output is negative.

You can also design an adder with three negative inputs and two negative outputs (Type 3 adder), but it is never used.

Type 0 and Type 3 adders are identical.

Type 1 and Type 2 adders are identical.





c in c out



a b



a c in



bc in

18: Datapath Functional Units CMOS VLSI Design 4th Ed.



Baugh-Wooley Multiplier





   

4 2 4  3 

 0

a i

      

4 2 4  3 

 0

b i

   

4 2 8

MSB



4 2 8  3  3 

a i b j



j i

1 0 4 4 0 2 ordinary multiplication 

4 2 4 1 3 

b j



4 2 4 2 extra terms 3 

a i

3  3 

 0 3 

 0

a i b j



4 2 4     2 4  3 

 0

b j

 1   

4 2 4     2 4  3 

 0

a i

 1    

4 2 8  3 

 0 3 

 0

a i b j



 

4  1 

4  1  2 8  

4 

4  2 4  3 

 0 

b i



a i b

4  2

 4 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Baugh-Wooley Multiplier

2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 2 0 1

0 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Baugh-Wooley Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Baugh-Wooley Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Fewer Partial Products

   Array multiplier requires N partial products If we looked at groups of r bits, we could form N/r partial products.

– Faster and smaller?

– Called radix-2 r encoding Ex: r = 2: look at pairs of bits – Form partial products of 0, Y, 2Y, 3Y – First three are easy, but 3Y requires adder 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Booth Encoding

  Let us first try out base 4 encoding

 2



 1 

 0

b i

 2 

 0

c i

The

c i

have to be problematic.

0,1,2,

. However,

is 

 2

  2 

 0

2 2

for even bits

 2

  2 

 0

 1 2 2

 1 for odd bits

18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Booth Encoding

 Numerical example:   10   10  1    0  2 4 2 0  0   0  2 3 2 2  0   1  2 2  1  2 1 2 4    1   0  2 1 2 0  0   2 3  10010  2  0  2 5    Reordering terms,

B c i c i

   {    2  shift left   3 shift right   2

 1 



 1  2,  1,0,1,2 

CMOS VLSI Design 4th Ed.

Booth Encoding

 The

c i

can be written as

4  

 1 

1 

2  2  2

1  

b b

3 5 



6 4   2 2

b b

5 7 

7 

8  2

9  Take and

b 9 b -1

. For an

bit unsigned number, take

b 8

as well.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Booth Encoding

 Take 18 as a numerical example again   10

0    10010  2 0  1  0  1  1  

1   2  1  1  0  1  0    1

2  0  1  1  1  0    1   10      2  4 0  1  4 1  1  4 2  18  For two’s complement signed numbers, extension to the left side should not be used.

  10  101101  1  101110      2  4 0  0  4 1   4 2   18

18: Datapath Functional Units



CMOS VLSI Design 4th Ed.

Booth Encoding

 Note that Booth notation is redundant.

  4    4  2  However, the method shown above always yields the same representation for the same binary numbers. 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

100

Booth Encoding

  Instead of 3Y, try –Y, then increment next partial product to add 4Y Similarly, for 2Y, try –2Y + 4Y in next partial product

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

101

Booth Encoding

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

102

Booth Hardware

 Booth encoder generates control lines for each PP – Booth selectors choose PP bits

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

103

Booth Multipliers

   Applicable to sequential, array, and parallel multipliers.

Additional recoding logic and more complex partial product generation (

+8n

in terms of area and

in terms of delay) Adder array/tree cut in half.

• Considerably smaller (array and tree) • Twice as fast for adder arrays • Slightly faster for adder trees.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

104

Booth Multipliers

   Negative partial products require sign extension.

Suited for signed multiplication.

Radix 8 (3-bit recoding) possible.

– Reduces partial products 3 times.

– Pre-computing 3B, … is difficult.

– Sometimes used.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

105

Sign Extension

 Partial products can be negative – Require sign extension, which is cumbersome – High fanout on most significant bit 0 x -1 x 0 s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s PP 0 PP 1 PP 2 PP 3 s PP 4 PP 5 PP 6 PP 7 PP 8 0 0 x 15 x 16 x 17

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

106

Simplified Sign Ext.

 Sign bits are either all 0’s or all 1’s – Note that all 0’s is all 1’s + 1 in proper column – Use this to reduce loading on MSB 1 1 1 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 1 1 1 1 1 1 s 1 1 1 1 s 1 1 1 1 s 1 1 s 1 s s s s s s s s PP 0 PP 1 PP 2 PP 3 PP 4 PP 5 PP 6 PP 7 PP 8

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

107

Even Simpler Sign Ext.

 No need to add all the 1’s in hardware – Precompute the answer!

s 1 s 1 s 1 s 1 s 1 s 1 s s s s s s s s s s s s PP 0 PP 1 PP 2 PP 3 PP 4 PP 5 PP 6 PP 7 PP 8

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

108

Advanced Multiplication

   Signed vs. unsigned inputs Higher radix Booth encoding Array vs. tree CSA networks

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

109

Tree Addition

  Wallace Trees.

Very irregular tree.

– Irregular wiring and/or layout – Non-uniform bit-arrival times at the final adder.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

110

Wallace Tree Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

111

Wallace Tree Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

112

Dot Diagram for Array Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

113

Dot Diagram for Tree Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

114

4:2 Tree Multiplier

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

115

4:2 Compressor

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

116

Carry-Save Adder

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

117

4:2 Compressors

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

118

4:2 Compressors

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

119

16x16 Booth Encoded Multipliers

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

120

TDM Multipliers

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

121

Vertical Compressor Slice

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

122

CPA Prefix Network

 Nonuniform arrival times

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

123

Multiplier Implementations

   Sequential Multipliers – Low performance, small area, resource sharing Braun or Baugh-Wooley Multiplier (array multiplier) – Medium performance, high area, high regularity – Layout generators => data paths and macro cells – Simple pipelining, faster CPA => higher speed Booth-Wallace Multiplier – High performance, high area, low regularity – Custom multipliers, netlist generators – Often pipelined (between CSA and CPA)

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 124

Composition from Smaller Multipliers



(2n x 2n)

-bit multiplier can be composed from 4

(n x n)

-bit multipliers (can be repeated recursively).



 

 

A H

n A H B H

2 2

 

A L



A H B L

   

B H

n A L B H



B L

  2



A L B L

This requires 4 (

n x n

)-bit multipliers and (

)-bit CSA and (

)-bit CPA.

Less efficient in terms of area and speed.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

125

Squaring

   Squaring is actually multiplication



2 

Multiplier optimizations possible.

 2 1

0     1 

0 partial products => optimized squarer better 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

126

Division

 Division basics

A B A

 

Q Q

 

R B B



;



  Conditions on values: 

 Algorithms  0,2 2

 1  ,

  0,2

 1  ,

 0 – Subtract and shift  – Sequential, recursive, non-associative

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

127

Division

   Basic Algorithm – Compare and conditionally subtract – Expensive comparison and CPA Restoring Division – Subtract and conditionally restore – Expensive CPA and restoring Non-restoring division – Detect sign, subtract/add, and correct by next steps.

– Expensive CPA

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 128

Division

 SRT Division – Estimate range, subtract/add (CSA), correct by next steps.

– Inexpensive CSA

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

129

Restoring Division

  Put

in register

and perform

in register

, divide steps (

is the quotient wordlength).

Each step consists of – Shift the register pair (

) one bit left.

– Subtract the contents of

from

, put the result back in

– If the result is negative, set the low order bit of

, otherwise to

– If the result is negative, restore the old value of

by adding the contents of

back into

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

130

Restoring Division

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

131

Restoring Division

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

132

Non-restoring Division

    A variant that skips the restoring step and instead works with negative residuals.

is negative, – Shift the register pair (

) one bit left.

– Subtract the contents of register

from

is negative, set the low-order bit of

, otherwise set to

After

cycles, – The quotient is in

– If

is positive, it is the remainder, otherwise it has to be restored (add

to it).

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

133

Non-restoring Division

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

134



Non-restoring Division



 

   1 

A CPA

O n

2 log  

 1 

T CPA

  or

O n

log



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

135

Signed Division

  Example: Signed non-restoring array divider.

B>0, final correction step omitted

 9

2 ,

 2

2  4



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

136

SRT Division

 Sweeney, Robertson, Tocher

q i

     1 if

0 if  

1 if

R i

 1

R i

 1   

B R i

 1 2



 If B is normalized  If 2

 1 

 2



i q i

    2



 1 

R i

 1     1 if 2



 1 0 if  2 1 if

R i

 1



    1

R i

 1 2 



R i

 1

 1 2



 1  2

 

 1

CMOS VLSI Design 4th Ed.

137

SRT Division

    Only 3 MSB are compared –

q ’ i

are estimated – CSA instead of CPA can be used Correction in the following steps + final correction step.

Redundant representation of

q i ’

(SD representation), final conversion necessary (CPA).

Highly regular and fast O(n) SRT aray dividers – Only slightly slower/larger than array multipliers

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

138

SRT Division

 

T A

 

nA CSA

 

nT CSA

 2

A CPA



T CPA



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

139

SRT Division

 Pre-normalization of divisor

½ ≤ d ≤ 1 x

. and dividend

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

140

SRT Division

     The quotient digit set plays a crucial role in the complexity of implementation.

Restoring algorithm:

0 ≤ q i ≤ r-1

Non-restoring algorithm:

q i

SRT: quotient digit selection function  1 if 1 2  2  

q i

 1        1 if 2 1 2

w i

 2     1 2  1 2 SRT division is very fast in the case of consecutive zeros in q. 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

141

SRT Division

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

142

SRT Division

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

143

High Radix Division

 Radix b b  2

m

,

q i

   b  1, K ,1,0,1, K , b  1      

m

quotient bits per step => fewer, but more complex steps.

Suitable for SRT algorithm => faster Complex comparisons and decisions Table look-up (Pentium bug)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

144

Pentium Bug

    March 1993: Intel introduces the Pentium June 1994: Prof. Thomas Nicely, Lynchburg College, reports errors in calculating twin primes reciprocals.

October 1994: After considerable background discussion, word starts circulating on the Internet. Others confirm error and find more instances.

November 1994: Tim Coe, of Vitess Semiconductor, proposes a [substantially correct] software model explain the cause.An Intel internal report analyzes a flaw in the Pentium FDIV instruction.

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 145

Pentium bug

 Intel CEO Andy Grove responds (Nov. 24, 1994): – Minor bug known at Intel since mid-94.– – All micros have bugs.

– “Average user” will never see the problem (MTBE: 27,000 years).

– Most applications do fewer than 1,000 divisions a day (?!).

– FDIV error rate is about 1.5 × 10 −9 – Error conditions guarantee small errors.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

146

Pentium Bug

   Response (continued) – Many applications (e.g. graphics) can tolerate occasional small errors.

– Offers replacement for justified need. Popular press generally accepts Intel’s claims about “obscure error.” Intel confirms 2 million defective chips have been shipped.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

147

Pentium Bug

   December 1994: IBM disagrees (MTBE: 24 days); stops shipment of Pentium based PCs.

– Even casual spreadsheet users may do about 4.2

× 10 6 divides per day.

– The error distribution is not uniform. – Under some reasonable conditions FDIV error rate can approach 10 −2 Some question IBM’s motives.

A flurry of Internet communication condemns Intel’s attitude and questions its evaluation of the problem.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

148

Pentium Bug

     Intel revises replacement policy. Hard to interpret policy but easy to accomplish in practice. 2% of home users and 10% businesses eventually get re placements.

Intel (Andy Grove) admits it mishandling the problem, but stands by its evaluation.

Public perception is that Intel was responsive ⇒ positive publicity.

March 1995: Coe, et al. article appears in IEEE Journ. Computational Sci. and Eng.

May 1995: Lamport article appears at TAPSOFT.

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 149

Pentium Bug

     Kahan posts should-have-known SRT test article.

1996: Intel establishes the world’s largest verification division, dominating industrial research through 20??.

Reported cost of the Pentium affair reportedly $450 million; $15/$16 billion market in 1996. Intel Marketing Rep: “. . . wrote it off to advertising.” 1997 –2000: All major μprocessor manufacturers adopt formal verification.

Surge in CAD industry tool offerings.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

150

Pentium Bug

    Significant research results appear in floating point verification.

2000 –2002: Articles, conference panel sessions on verification “culture.” IC technology roadmap: looming “design crisis.” Nice discussion of SRT and the bug in http://www.eng.utah.edu/~cs5830/handouts/lec-SRT.pdf

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

151

Pentium Humor

   Q: How many Pentium designers does it take to screw in a light bulb?

– A: 1.99904274017, but that’s close enough for non-technical people.

Q: What’s another name for the ”Intel Inside” sticker they put on Pentiums?

– A: The warning label.

Q: What do you call a series of FDIV instructions on a Pentium?

– A: Successive approximations.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

152

Pentium Humor

– Q: Why didn’t Intel call the Pentium the 586?

• A: Because they added 486 and 100 on the first Pentium and got 585.999983605.

– Q: According to Intel, the Pentium conforms to the IEEE standards 754 and 854 for floating point arith-metic. If you fly in aircraft designed using a Pentium, what is the correct pronunciation of ”IEEE”?

– A: Aaaaaaaiiiiiiiiieeeeeeeeeeeee!

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

153

Division by Multiplication

  Division by convergence

y Q



B i

 1 

A B



B i R i A



B

 

R

0

R

1 L

R

0

R

1 L

R m

 1

R m

 1 2

n

4  1  4

y

 

B i



A



B

  1 

y

1 2 3 

R i

1

B

1 

Q

1

B

 2

n

 1 

y

2   1 

B i

2 

n

,

R i

 2 

B i

2 

n



B i

 1 Algorithm: 

B i

 1 

B i



R i

,

A i

 1 

A i



R i R A

0

i

 

B i A

 1 , ,

B

0

i

  0, K

B

, ,

Q m

  1

A m

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units



154

Division by Multiplication

 Quadratic convergence

L

  log

n

 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

155

Division by Reciprocation

 Use the reciprocal

Q



A B



A

 1

B

 How to find the reciprocal?

find   by recursion    1

X X i

 1 

B

,      1

X

2 

X i

   

i

,

f

 1

B

  0 

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

156

Division by Reciprocation

  Algorithm:

X i

 1

X

0  

X i



B

,

Q

 2 

B

 

X m X i

 ;

i

 0, K ,

m

 1 Quadratic convergence

L = O (

log

n)

 Speed-up: first approximation of

X 0

 from table.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

157

Divider Implementations

  Iterative dividers (through multiplication): – Resource sharing of existing components (multiplier) – Medium performance, medium area – High efficiency if components are shared Sequential dividers (restoring, non-restoring, SRT) – Resource sharing of existing components (e.g. adder) – Low performance, low area

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

158

Divider Implementations

  Array dividers – Dedicated hardware component – High performance, high area – High regularity -> layout generators, pipelining – Square root extraction possible by minor changes – Combination with multiplication and/or square root No parallel dividers exist as compared to parallel multipliers.

– Sequential nature of division.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

159

Square Root

 Algorithm:

A



R



Q

,

A



Q

2 

R A



Q i

  0,2 2

n Q i

 1  1  ,

Q

 

q i

2

i

  0,2

n



q n

 1 , K  1  ,

q i

,0, K ,0 

q i Q i

2   

Q i

 1 

R i

 1 

q i

2

i

 2

i

 2  2

Q i

 1  2

Q i

 1  2

i



q i

2

i

  ,

Q i R i



R i

 1 

q i

2

i

 2

Q i

 1 

q i

2

i

 ;

i

  2

Q i

 1 

Q i

 1

n



q i

2

i



q i

2

i

 1, K ,0 

R n



A

,

Q n

 0 ,

R



R

0 ,

Q



Q

0

CMOS VLSI Design 4th Ed.

160

Square Root

   Implementation: – Similar to division -> same algorithms applicable – Restoring, non-restoring, SRT, high radix Combination with division in same component possible Only triangular array required

A T



A DIV

2 

T DIV



18: Datapath Functional Units CMOS VLSI Design 4th Ed.

161

Elementary Functions

     Exponential function:

e x

, exp(

x

) Logarithm function: ln

x

, log

x

Trigonometric functions: sin

x

, cos

x

, tan

x

Inverse trig. Functions: arcsin

x

, arccos

x

, arctan

x

Hyperbolic functions: sinh

x

, cosh

x

, tanh

x

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

162

Algorithms

     Table lookup – Inefficient for large word lengths Taylor series expansion – Complex implementation Polynomial and rational approximations Shift and add algorithms Convergence algorithms – Similar to division by convergence – Two (or more) recursive formulas: one formula converges to a constant, the other to the result.

CMOS VLSI Design 4th Ed.

18: Datapath Functional Units 163

Algorithms

 Coordinate rotation (CORDIC) – 3 equations for x-, y- coordinate, and angle – Computes all elementary functions by proper input settings and choice of odes and outputs – Simple, universal hardware, small look-up table.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

164

Design Levels

 Transistor level design – Circuit and layout designed by hand (full custom) – Low design efficiency – High circuit performance – High flexibility: choice of architecture and logic style – Transistor level circuit optimizations • Logic style (static/dynamic logic, complementary CMOS/pass-transistor logic) • Special arithmetic circuits better than with gates.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

165

Design Levels

 Gate level design – Cell based design techniques: standard cells, gate-array/sea-of-gates, field programmable gate array (FPGA) – Circuit implemented by hand or synthesis (library) – Layout implemented by automated place and route – Medium to high design efficiency – Medium to low circuit performance – Medium to low flexibility: full choice of architecture.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

166

Design Levels

 Block level design – Layout blocks and netlists from parameterized automatic generators or compilers – High design efficiency – Medium to high circuit performance – Low flexibility (limited choice of architectures)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

167

Design Levels

 Block level design – Implementations: • Data-path: bit-sliced, bus oriented layout, implementation of entire data paths, medium performance, medium diversity • Macro-cells: tiled layout, fixed/single operation components, high performance, small diversity • Portable netlists: gate level design

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

168

Synthesis

 High-level synthesis – Synthesis from abstract, behavioral hardware description (e.g., data dependency graphs) using e.g. VHDL – Involves architectural synthesis and arithmetic transformations – High-level synthesis still not fully mature

169 18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Synthesis

 Low-level synthesis – Layout and netlist generators – Included in libraries and synthesis tools – Low level synthesis is state-of-the art – Basis for efficient ASIC design – Limited diversity and flexibility of library components

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

170

Synthesis

 Circuit optimization – Efficient optimization of random-logic is state of the art.

– Optimization of entire arithmetic circuits is not feasible • Only local optimizations possible – Logic optimization cannot replace the synthesis of efficient arithmetic circuit structures using generators.

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

171

Low Power

   High glitching activity due to high bit dependencies and large logic depth Reduce the switched capacitance by choosing an area efficient circuit architecture Allow for lower supply voltage by speeding up the circuitry

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

172

Low Power

 Reduce the transition activity – Apply stable inputs when circuit not in use • Disable circuits – Reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difficult to realize) – Reduce glitching transitions by reducing logic depth – Take advantage of correlated data streams – Choose appropriate number representations (e.g. Gray codes for counters)

18: Datapath Functional Units CMOS VLSI Design 4th Ed.

173

Testability

   Testability goal: high fault coverage with few test vectors that are easy to generate/apply.

Random test vectors: easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits.

Special test vectors: sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits.

174 18: Datapath Functional Units CMOS VLSI Design 4th Ed.

Testability

 Hard detectable faults found in: – Circuits of arithmetic operations with inherent special cases (arithmetic exceptions): detectors, comparators, incrementers, and counters (MSBs), adder flags.

– Circuits using redundant number representations (≠ redundant hardware): dividers (Pentium bug!)

175 18: Datapath Functional Units CMOS VLSI Design 4th Ed.