下載/瀏覽

Download Report

Transcript 下載/瀏覽

Division by Convergence
授課老師:王立洋
老師
製作學生:M9535204
蔡鐘葳
1/30
Outline
▓ Speedup of Convergence Division
▓ Hardware Implementation
▓ Analysis of Lookup Table Size
▓ Reference
2/30
16.4.
Speedup of Convergence Division
3/30
Introduction
z zx (0 ) x (1)  x ( m 1)
q   (0 ) (1)
d dx x  x ( m 1)
Compute y = 1/d
Do the multiplication yz
Division can be performed via 2 log2 k – 1
multiplications
This is not yet very impressive
64-bit numbers, 5-ns multiplier  55-ns division
4/30
Three Types of Speedup
Three types of speedup are possible:
Reducing the number of multiplications (reduce m)
Using narrower multiplications (reduce the width of some
x(i)s)
Performing the multiplications faster
5/30
Initial Approximation
Convergence is slow in the beginning:
It takes 6 multiplications to get 8 bits of convergence and
another 5 to go from 8 bits to 64 bits
Since x(0) x(1) x(2) is essentially an approximation to 1/d,
these four initial multiplications can be replaces by a
table-lookup step that directly supplies x(0+)
6/30
Initial Approximation via Table Lookup
Better approx
Approx to 1/d
d x(0) x(1) x(2) = (0.1111 1111 . . . )two
Read this value, x(0+), directly replaced by
a table-lookup step, thereby reducing 6
multiplications to 2
A 2w  w lookup table is necessary and sufficient for
w bits of convergence after the first pair
multiplications
7/30
Example with 4-bit lookup
Example with 4-bit lookup: d = (0.1011 xxxx . . .)two
11/16  d < 12/16
Inverses of the two extremes are 16/11  1.0111 and 16/12
 1.0101
So, 1.0110 is a good estimate for 1/d
1.0110  0.1011 = (11/8)  (11/16) = 121/128 = 0.1111001
1.0110  0.1100 = (11/8)  (3/4) = 33/32 = 1.000010
8/30
Fig. 16.3
1
1 - ulp
d
q-ε
z
After the second pair
of multiplications
After table lookup and
first pair of multiplications,
replacing several iterations
Iterations
Fig. 16.3 Convergence in division by repeated multiplications
with initial table lookup.
9/30
Fig. 16.3
For division by repeated multiplications
We saw that convergence to 1 and q occurred from below
If at some point in our iterations, d(i) overshoots 1
(becomes 1 + ε)
The next multiplicative factor 2 - d(i) = 1 -ε will lead to
a value smaller than 1
But still closer to 1, for d(i+1)
10/30
Analysis the Truncating Multiplicative (1/2)
We begin by noting that
dx(0) x(1) … x(i) = 1 – y(i)
x(i+1) = 2 – (1 – y(i)) = 1 + y(i)
Assume that we truncate 1 – y(i) to an a-bit fraction
Thus obtaining (1 – y(i))T with an error of α< 2-a
11/30
Analysis the Truncating Multiplicative (2/2)
With this truncated multiplicative factor, we get
x(i+1) = 2 – (1 – y(i)) = 1 + y(i)
Where 0 ≦ (x(i+1))T – x(i+1) < 2-a
Thus
dx(0) x(1) … x(i) x(i+1)T = (1 – y(i))(1 + y(i) + α)
= 1 – (y(i))2 + α(1 – y(i)) = dx(0) x(1) … x(i) x(i+1) + α(1 – y(i))
12/30
Fig. 16.4
1
1 ± ulp
d
q±ε
z
Iterations
Fig. 16.4 Convergence in division by repeated multiplications with
initial table lookup and the use of truncated multiplicative factors.
13/30
Fig. 16.4
The first pair of multiplications following the tablelookup involve a narrow multiplier
It may be faster than a full-width multiplications
If the multiplier is suitably truncated
The result is that convergence occurs from above or below
14/30
Fig. 16.5
Approximate
iteration
1
B
< 2 a
A
d x (0) x (1) ... x (i)
d x (0) x (1) ... x (i) (x (i+1) ) T
d x (0) x (1) ... x (i) x (i+1)
Precise
iteration
Fig. 16.4 One step
in convergence
division with truncated
multiplicative factors.
Iteration
i
i+1
15/30
Fig. 16.5
If we aim to go from l bits to 2l bits of convergence
We can truncate the next multiplicative factor to 2l Bits
Consider Fig. 16.5
A is the result of precise iteration, is no more than 2-2l
below 1
With a = 2l, B, arrived at by the approximate iteration, will
be no more than 2-2l above 1
16/30
Example
64-bit multiplication
Initial step: Table of size 256  8 = 2K bits
Middle steps: Multiplication pairs, with 9, 17, and 33-bit
multipliers
Final step: Full 64  64 multiplication
17/30
16.5.
Hardware Implementation
18/30
Hardware Implementation
z(i)
x(i)
2's Compl
d(i+1)
x(i+1)
z(i+1)
x(i+1)
(i+1) (i+1)
z
d(i) x(i)
z(i) x(i)
d(i+1)x(i+1)
(i+1)
z(i+1)
z(i) x(i)
d
d
x
(i+1) (i+1)
x
d(i+2)
Fig. 16.6 Two multiplications fully overlapped
in a 2-stage pipelined multiplier.
19/30
Fig. 16.6
As the computation of z(i) x(i) moves from the top to
the bottom pipeline stage
The next iteration begins by computing the stage of d(i+1)
x(i+1)
20/30
Implementing Division with Reciprocation
Reciprocation: Multiplication pairs are datadependent, so they cannot be pipelined or performed
in parallel
Since in the recurrence x(i+1) = x(i) (2 - x(i)d)
The second multiplication by x(i) needs the result of the
first one
The most promising speedup method relief on
deriving a better starting approximation to 1/d
21/30
The Required Lookup Table
The Required Lookup Table can be made smaller, or
totally eliminated, by a variety of methods
Store the reciprocal values for fewer points
Use linear or higher-order interpolation to compute the
starting approximation
Formulate the starting approximation as a multi-operand
addition problem
Use or pass through the multiplier’s CSA tree, suitably
augmented, to compute it
22/30
16.6.
Analysis of Lookup Table Size
23/30
Theorem for Table Size
Theorem 16.1: To get w  5 bits of convergence after the first
iteration of division by repeated multiplications, w bits of d (beyond
the mandatory 1) must be inspected. The factor x(0+) read out from
table is of the form (1.xxx . . . xxx)two, with w bits after the radix point
Based on the theorem, the required table size is 2w × w
The cases w < 5:
Practically uninteresting (allow smaller table)
We can ignore them
24/30
Analysis of Lookup Table Size (1/4)
Recall that our objective is to have
1 – 2-w ≦ dx(0+) ≦ 1 + 2-w
Let d = (0.1 d-2 d-3) …d-(w+1) d-(w+2) …d-l)two
----------------------w bits to be inspected
Theorem 16.1 postulates the existence of x(0+) =
(1. x+-1 x+-2 …x+-w)two satisfying the objective
inequality
25/30
Analysis of Lookup Table Size (2/4)
Let u = (1 d-2 d-3) … d-(w+1))two
satisfying 2w ≦ u < 2w+1
We have 2-(w+1) u ≦ d < 2-(w+1) (u+1)
Similarly, let v = (1x+-1 x+-2 …x+-w)two
The objective inequality can be rewrite as
2w – 1 ≦ dv ≦ 2w + 1
26/30
Analysis of Lookup Table Size (3/4)
We derive the following sufficient conditions
2w - 1 ≦ 2-(w+1)uv
2-(w+1) (u+1)v ≦ 2w + 1
The conditions lead to the following restrictions on v
2
w1
2
u
w
v 2
1
w1
(2  1)
u 1
w
27/30
Analysis of Lookup Table Size (4/4)
The latter condition is equivalent to


 2 w1 2 w  1   2 w1 (2 w  1) 



u

  u 1 
The last inequality always holds is left as an exercise
Completes the “sufficiency” part of the proof
At least w bits of d must be inspected
x(0+) must have at least w bits after the radix point
28/30
Example
Table 16.2 Sample entries in the lookup table replacing the
first four multiplications in division by repeated multiplications
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
Address
d = 0.1 xxxx xxxx
x (0+) = 1. xxxx xxxx
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
55
0011 0111
1010 0101
64
0100 0000
1001 1001
–––––––––––––––––––––––––––––––––––––––––––––––––––––––
Example: Table entry at address 55 (311/512  d < 312/512)
For 8 bits of convergence, the table entry f must satisfy
(311/512)(1 + . f)  1 – 2–8
199/311  .f  101/156
(312/512)(1 + . f)  1 + 2–8
or
163.81 ≤ 256  . f ≤ 165.74
Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two
29/30
Reference
[1] Behrooz Parhami, “Computer Arithmetic Algorithms
and Hardware Designs,” Oxford University Press.
2000.
30/30