Chapter4. Datapath Design

Download Report

Transcript Chapter4. Datapath Design

Chapter4. Datapath Design
Fixed-point arithmetic
: twos complement code for fixed-point binary addition / subtraction
Basic adders
2-level combinational circuit : fastest, but too complex
Multi-level combinational circuit or sequential circuit
: trade-off between operating speed and circuit complexity
Full-adder
xi
yi
Ci

Half-adder
xi
yi
Ci-1
Zi
zi = xi ⊕ yi ⊕ ci-1
ci = xiyi + xici-1 + yici-1
Ci
Zi
z i = xi ⊕ y i
c i = xi y i
• ripple carry adder
• A serial adder
Ci-1
xi
yi
full
adder
C0
x0 y0
x1 y1
xn-1 yn-1



Z0
Z1
Zn-1
Cn
Ci
D
cl
clock
Propagation delay
Complex hardware
Slow, but small hardware
( independent of n )
• n-bit twos complement adder/subtractor
n Z=X  Y
Carry out
Carry in
If S = 0, Z = Y + X
If S = 1, Z = Y X = Y + X + 1
n-bit parallel adder
n
y
n
x
·
Subtract S
High-speed adder
carry-look ahead adder(CLA) : to reduce the time associated with carry
propagation.
ci = qi + pici-1
carry generate : qi = xi · yi carry
= qi + pi(qi-1+ pi-1ci-2)
propagate : pi = xi + yi
= qi + pi(qi-1 + pi-1(qi-2 + pi-2c i-3))
= qi + pi(··· cin)
A 4-bit CLA
CLA
g0
p0
Z0
cin
g1
Z1
p1
g2
p2
Z2
g3
p3
Z3




x0 y0
x1 y1
x2 y2
x3 y3
cout
Due to the complexity of carry-lookahead
circuit
up to 8 bit
16-bit addition
c16
4-bit
CLA
c12
Z15Z14Z13Z12
4-bit
CLA
g0 = x0 · y0
p0 = (x0 + y0)
c1 = g0 + p0 cin
g1= x1 · y1
p1 = (x1 + y1)c1 c2 = g1 + c1p1 = g1 + g0p1 + p0p1cin
c3 = g2 + g1p2 + g0p1p2 + p0p1p2cin
c4 = g3 + g2p3 + g1p2p3 + g0p1p2p3 +p0p1p2p3cin
c8
4-bit
CLA
c4
4-bit
CLA
Z3Z2Z1Z0
cin
4.1.2 Multiplication
⃘ multiplication as shifting and addition : simplest and slowest
P=X· Y
ex) 1010  1101
→ shift and add
1010
1101
1010
0000
1010
1010
10000010
pi+1 = pi + xj2iy
Twos-complement multiplier
multiplication of 2C numbers
: difficult for negative operands such as introducing leading 1s
rather than leading 0s for right-shift
One simple approach
: negate all negative operands, perform unsigned
multiplication on the resulting positive numbers, and then
negate the result if necessary
up to four extra clock cycles are needed
Robertson’s twos-complement multiplier
 X = 2n  X
= 111 · · · 11  xn 1 xn 2 · · · x1 x0 + 00 · · · 01
If X is positive( xn-1 = 0 ),
n–2
then
i
X =  2 xi
i=0
If Xi is negative( xn-1 = 1 ),
then
– X = 111 · · · 11 – ( 0 xn – 2 xn – 3 · · · x1 x0 + 100 · · · 00 ) + 000 · · · 01
= 2n – 1 – xn – 2 xn – 3 · · · x1 x0
n–2
n
–
1
n
–1
X=–2
+ xn – 2 xn –3 · · · x1 x0 = –2 +  2i xi
i=0
X=–2
n–1
n–2
xn-1 +
 2i xi for both
i=0
For n = 6 and X = 101101
X = – 25  1 + 24  0 + 23  1 + 22  1 + 21  0 + 20  1 = – 19
If X is a 2C fraction instead of an integer,
X=–
n–2
20
xn – 1 +  2i – n + 1 xi
i=0
Robertson’s multiplier for 2C fractions :
P = X · Y, X = x7 x6 · · · x0, Y = y7 y6 · · · y0
Case 1 : ( x7 = y7 = 0 )
both X and Y are positive
pi : = pi + xi Y; pi+1 : = 2-1pi ( Leading 0s are introduced into A during right shift )
Use a flag F whose value determine A(0).
F is initially set to 0.
F ← (y7 AND xi) OR F
Case2 : (x7 = 0, y7 = 1)
Leading 0s should be shifted into A, until the first 1 in xi is
encountered. Multiplication of Y by this 1 and addition of the result to A
causes pi to become negative, from which point on leading 1s must be shifted
into A.
Case3 : (x7 = 1, y7 = 0)
For the first seven add-and-shift steps, same as case 1.
i=6
p7 = ∑ 2i – 7 xi Y
For the final step
i=0
i=6
P ← p7 – Y = ∑ 2i – 7 xi Y – Y = ( – x7 + ∑ 2 i – 7 xi ) Y
Case 4 : (x7 = 1, y7 = 1)
i=0
For the first 7 steps, the same as case 2. For the final step, the same as case 3.
Booth’s algorithm
→ treats both negative and positive operand uniformly
→ possibly use fewer than n addition or subtractions, thus possibly fast
multiplication
P=X·Y
Two adjacent bits xixi-1 are examined in each step.
If xixi-1= 01, Y is added to the accumulated partial product pi.
If xixi-1= 10, Y is subtracted from pi.
If xixi-1= 00 or 11 neither addition nor subtraction and subsequent right shift of pi.
X = a positive number
A subsequence X* = xi x i-1 · · · x i-k x i-k-1 = 011 · · · 1 0
i -1
In Robertson’s method = ∑ 2j Y
j = i-k
In Booth’s algorithm
xixi-1=01 → 2iY
xi-kxi-k-1 = 10 → -2i-kY
k-1
2iY-2i-kY=2i-kY(2k-1)=2i-kY
k-1
= ∑ 2m+i-kY
m=0
∑ 2m
m=0
= ∑ 2jy (*setting j=m+i-k)
j=i-k
i-1
Combinational array multiplier
: possible with the advances in VLSI
n-1
P=X·Y=∑
n-1
2i x
iY
i=0
y0
y1
y2
y3
=∑
2i(
i=0
n-1
∑
n-1 n-1
2j x
i yi )
j=0
= ∑ ∑ 2i 2j ( xi yj )
i=0 j=0
full-adder array for a 3-bit combinational multiplier
x0
x1
x0y2 0
x0y0
x0y1
x0y1 0
x0y0
x0y2
x1y2
+
x1y1
+
x2y0
+
x1y0
x2
x3
z5
x2y2
+
x2y1
+
+
0
z4
z3
z2
z1
z0
Division
D = Q  V + R, 0 < | R | < V
DQV
In multiplication : the shifted multiplicand is added
In division : the shift division is subtracted.
Divisions by repeated multiplication
: efficient for systems containing a high-speed multiplier
In each iteration a factor Fi is generated,
D  F0  F1  F2
Q=
V  F0  F1  F2 · · ·
Fi is chosen so that V  F0  F1  F2 · · · converges rapidly toward one.
Suppose D and V are positive normalized fractions, V = 1 – x for x < 1.
First, set F0 = 1 + x
V  F0 = (1 – x) ( 1 + x ) = 1 – x2 : closer one than V
next, F1 = 1 + x2
V  F0  F1 = ( 1 – x2) ( 1 + x2 ) = 1 – x4
The process terminates when Vi = 0.11 · · · 11, the number closest to one
for the given word size.
4.2 ALU Design : to execute data processing instructions
4.2.1 Combinational ALU
The simplest ALU : a 2C adder/subtractor and a word-based logic function
circuit.
f(xi, yi) = xi yi s3 + xi yi s2 + xi yi s1 + xi yi s0
with 4 bit select S
universal function generation
The complete 4-bit ALU : more than 100 gate and depth 9.
An efficient 4-bit ALU IC chip : 60 gates and depth 6, by shaing p, q and sum
circuits with the logic unit.
Fi = IPi  IGi  ( ICi-1 + M )
I Pi = Ai + BiS0 + BiSi
IGi = AiBiS2 + AiBiS3
For the logic operation ( M = 1 )
Fi = Ai Bi S0 + Ai Bi S1 + Ai Bi S2 + Ai Bi S3
For the addition ( M = 0 )
Fi = I Pi  IGi  I Ci-1
with S = 1001
F = A plus B plus Cin ( carry-lookahead adder )
with S = 0110
F = A minus B minus Cin( 2C subtraction )
4.2.2 Sequential ALU
Combinational multipliers and dividers are more complex and slower than
addition/subtraction. The number of gates in the multiply – divide logic is
greater than that of adder – subtractor by a factor of n.
low-cost sequential circuits where add/subtract takes one clock cycle
and multiplication/division multicycles.
Addition
Subtraction
Multiplication
Division
AND
OR
EXCLUSIVE-OR
NOT
AC := AC + DR
AC := AC – DR
AC.MQ := DR  MQ
AC.MQ := MQ/DR
AC := AC and DR
AC := AC or DR
AC := AC xor DR
AC := not(AC)
ALU expansion
1. Spatial expansion( bit-sliced ALU ) :
form a single km-bit ALU by connecting k copies of the m-bit ALU IC.
Each component ALU concurrently process a separate “slice” of m bits.
2. Temporal expansion : use a m-bit ALU chip to perform an operation on km-bit
consecutive steps. In each step, ALU process a separate m-bit slice of each
operand
multicycle processing
4.3.1 Floating-point arithmetic
( XM , XE ) → X = 2X E · XM
XE
YE
Y E XE  YE
2
2
2
addition: X + Y =
·XM +
·YM =
(2
· XM + YM )

subtraction: X – Y = ( 2XE YE · XM – YM ) 2Y E
multiplication: XY = XMYM ·2
XE  YE

division: X/Y = XM/YM · 2XE YE
addition(subtraction)
1. Compute YE-XE : fixed-point subtraction.

2. Shift XM by (YE-XE) places to form XM · 2XE YE

3. Compute XM ·2XE YE YM
4. Normalize by left-shifting(right-shifting) mantissa and
decreasing(increasing) exponents by 1.
guard bit: to preserve accuracy by rounding rather than truncating.
ex) X = 0.1011  2-5
Y = 0.1101  2-6
X + Y = 2-5(0.1011 + 0.1101  2-1) = 2-5(0.1011 + 0.01101)
If we have only 4-bit mantissa then the fifth bit is truncated.  we provide
one more bit(guard bit)
Rounding is accomplished by adding 1 to xn+1
0.1011
and truncate to n bits. When a mantissa is right+ 0.01101
shifted during the alignment step of addition/subtr+ 1.00011
action, the bit shifted from the right end is retained
rounding as a guard bit.
1
If xn+1=0 ⇒ no effect.
1.00100
If xn+1=1 ⇒ rounding
When multiplying two numbers
X·Y
XM = 0.1, YM=0.1
⇒ XM · YM = 0.01
If XM·YM is now truncated or rounded to n bits then the precision is only (n-1) bits
⇒ We need two guard bit(∵always “0”)
XE
Exponent biasing : X·Y = XM 2
YE
·YM2
= XMYM2
X E Y E
Example) If exponents are added using ordinary integer arithmetic, the resulting
exponent is doubly biased and must be corrected by subtracting the bias.
If bias is 24-1=8,
XE = 1111
+ YE = 0101
15-8 = 7
5-8 = -3
10100
4
+ 8
12 (1100)
not same
Arithmetic Processor
CPU handles control and non-arithmetic function
→ no room for complex arithmetic operation such as trigonometric functions.
→ rely on much slower software routines to provide missing arithmetic operation.
→ an auxiliary special-purpose arithmetic processor to execute a class of
instructions which are not executable by CPU itself: speed execution by
replacing software routines with hardware (example : a vector processor).
Two approaches for arithmetic processors
① treat as a peripheral: a peripheral processor. CPU assigns addresses for a
peripheral processor in the CPU memory or I/O address space: The CPU and
peripheral processor are independent. CPU may proceed to other tasks while
the peripheral processor is busy. Main disadvantage: slow communication links.
② coprocessor: The peripheral processor is closely coupled to the CPU so that its
instructions and register set are extension to those of CPU. → CPU’s instruction
set contains a special subset of codes reserved for the coprocessor. The
communication between the CPU and coprocessor is implemented in hardware.
A coprocessor is tailored to a particular CPU. Even if no coprocessor is
presented, special instructions for coprocessor can be included in the CPU
program. → executed by a software routine.
The coprocessor approach makes it possible to provide either hardware or
software support for certain instructions without changing the program.
4.2 Pipeline processing: to increase processor throughput without large extra hardware.
A pipeline processor consists of a sequence of m data-processing stages, which
collectively perform a single operation on a system of data operands passing
through them.
When the pipeline is full, m separate one being executed concurrently, each in a
different stage, producing a new result every clock cycle.
T : each stage operation time
mT : delay or latency of the pipeline, the time to complete a single operation
1 : throughput, the max. number of operations per second
T
The performance of the pipeline is determined by the delay(latency) T of a single stage,
not by mT of the entire pipeline.
 a speedup factor of m compared to a nonpipelined implementation.
Any operation that can be decomposed into a sequence of suboperations of the same
complicity  pipeline processing
The time required using a nonpipelined processor : 4T
N consecutive addition : (N+3)T
Speed up =
4N
N 3
Pipeline design
1. Find a suitable multistage sequential algorithm
 Each stage should have roughly the same execution time.
2. Place a fast buffer register between stages.
Whether a particular function F should be implemented by a pipelined or nonpipelined
processor:
For an m-stage pipelined implementation, F is divided into F1, F2, · · ·, Fm.
Let Fi be realizable by a Ci with propagation delay(execution time) Ti.
Let TR be the delay of each stage Si due to its buffer register Ri and associated control logic.
min. clock period TC = max{Ti} + TR
Throughput : 1/TC
m
For a nonpipelined implementation, the execution time T   Ti
m
If TC<  Ti , then pipelining increases performance.
i 1
i 1
Pipelined multipliers
x, y : n-bit fixed-point number
Pi = Pi – 1 + xi 2i Y
main disadvantage : slow speed of carry-propagation logic, extra hardware
complexity( needed M cells : n2, buffer registers : 3n2)
rarely used
•
A carry-save( Wallace tree ) multiplier: well suited to pipelining
Systolic arrays : by interconnecting a set of identical processing cells in a uniform
manner. The name “systolic” derives from the rhythmic contraction of the heart.
X, Y : n  n matrix Z = X · Y
n
Zij =  xi,k  yk, j
k=1
The major characteristics of a systolic array
1.
2.
3.
4.
5.
6.
It provides a high degree of parallelism by processing many sets of
operands concurrently.
Partially processed data sets flow synchronously through the array in
pipeline fashion, but possibly in several directions at once, with complete
results eventually appearing at the array boundary.
The use of uniform cells and interconnection simplifies implementation, for
example, when using single-chip VLSI technology.
The control of the array is simple, since all cells perform the same
operations; however, care must be taken to supply the data in the correct
sequence for the operation being implemented.
If the X and Y matrices are generated in real time, it is unnecessary to store
them before computing X × Y, as with most sequential or parallel
processing techniques. Thus the use of systolic arrays reduces overall
memory requirements.
The amount of hardware needed to implement a systolic array like that of
Figure 4.59 is relatively large, even taking maximum advantage of VLSI.