Transcript 下載/瀏覽
Division by Convergence 授課老師:王立洋 老師 製作學生:M9535204 蔡鐘葳 1/30 Outline ▓ Speedup of Convergence Division ▓ Hardware Implementation ▓ Analysis of Lookup Table Size ▓ Reference 2/30 16.4. Speedup of Convergence Division 3/30 Introduction z zx (0 ) x (1) x ( m 1) q (0 ) (1) d dx x x ( m 1) Compute y = 1/d Do the multiplication yz Division can be performed via 2 log2 k – 1 multiplications This is not yet very impressive 64-bit numbers, 5-ns multiplier 55-ns division 4/30 Three Types of Speedup Three types of speedup are possible: Reducing the number of multiplications (reduce m) Using narrower multiplications (reduce the width of some x(i)s) Performing the multiplications faster 5/30 Initial Approximation Convergence is slow in the beginning: It takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits Since x(0) x(1) x(2) is essentially an approximation to 1/d, these four initial multiplications can be replaces by a table-lookup step that directly supplies x(0+) 6/30 Initial Approximation via Table Lookup Better approx Approx to 1/d d x(0) x(1) x(2) = (0.1111 1111 . . . )two Read this value, x(0+), directly replaced by a table-lookup step, thereby reducing 6 multiplications to 2 A 2w w lookup table is necessary and sufficient for w bits of convergence after the first pair multiplications 7/30 Example with 4-bit lookup Example with 4-bit lookup: d = (0.1011 xxxx . . .)two 11/16 d < 12/16 Inverses of the two extremes are 16/11 1.0111 and 16/12 1.0101 So, 1.0110 is a good estimate for 1/d 1.0110 0.1011 = (11/8) (11/16) = 121/128 = 0.1111001 1.0110 0.1100 = (11/8) (3/4) = 33/32 = 1.000010 8/30 Fig. 16.3 1 1 - ulp d q-ε z After the second pair of multiplications After table lookup and first pair of multiplications, replacing several iterations Iterations Fig. 16.3 Convergence in division by repeated multiplications with initial table lookup. 9/30 Fig. 16.3 For division by repeated multiplications We saw that convergence to 1 and q occurred from below If at some point in our iterations, d(i) overshoots 1 (becomes 1 + ε) The next multiplicative factor 2 - d(i) = 1 -ε will lead to a value smaller than 1 But still closer to 1, for d(i+1) 10/30 Analysis the Truncating Multiplicative (1/2) We begin by noting that dx(0) x(1) … x(i) = 1 – y(i) x(i+1) = 2 – (1 – y(i)) = 1 + y(i) Assume that we truncate 1 – y(i) to an a-bit fraction Thus obtaining (1 – y(i))T with an error of α< 2-a 11/30 Analysis the Truncating Multiplicative (2/2) With this truncated multiplicative factor, we get x(i+1) = 2 – (1 – y(i)) = 1 + y(i) Where 0 ≦ (x(i+1))T – x(i+1) < 2-a Thus dx(0) x(1) … x(i) x(i+1)T = (1 – y(i))(1 + y(i) + α) = 1 – (y(i))2 + α(1 – y(i)) = dx(0) x(1) … x(i) x(i+1) + α(1 – y(i)) 12/30 Fig. 16.4 1 1 ± ulp d q±ε z Iterations Fig. 16.4 Convergence in division by repeated multiplications with initial table lookup and the use of truncated multiplicative factors. 13/30 Fig. 16.4 The first pair of multiplications following the tablelookup involve a narrow multiplier It may be faster than a full-width multiplications If the multiplier is suitably truncated The result is that convergence occurs from above or below 14/30 Fig. 16.5 Approximate iteration 1 B < 2 a A d x (0) x (1) ... x (i) d x (0) x (1) ... x (i) (x (i+1) ) T d x (0) x (1) ... x (i) x (i+1) Precise iteration Fig. 16.4 One step in convergence division with truncated multiplicative factors. Iteration i i+1 15/30 Fig. 16.5 If we aim to go from l bits to 2l bits of convergence We can truncate the next multiplicative factor to 2l Bits Consider Fig. 16.5 A is the result of precise iteration, is no more than 2-2l below 1 With a = 2l, B, arrived at by the approximate iteration, will be no more than 2-2l above 1 16/30 Example 64-bit multiplication Initial step: Table of size 256 8 = 2K bits Middle steps: Multiplication pairs, with 9, 17, and 33-bit multipliers Final step: Full 64 64 multiplication 17/30 16.5. Hardware Implementation 18/30 Hardware Implementation z(i) x(i) 2's Compl d(i+1) x(i+1) z(i+1) x(i+1) (i+1) (i+1) z d(i) x(i) z(i) x(i) d(i+1)x(i+1) (i+1) z(i+1) z(i) x(i) d d x (i+1) (i+1) x d(i+2) Fig. 16.6 Two multiplications fully overlapped in a 2-stage pipelined multiplier. 19/30 Fig. 16.6 As the computation of z(i) x(i) moves from the top to the bottom pipeline stage The next iteration begins by computing the stage of d(i+1) x(i+1) 20/30 Implementing Division with Reciprocation Reciprocation: Multiplication pairs are datadependent, so they cannot be pipelined or performed in parallel Since in the recurrence x(i+1) = x(i) (2 - x(i)d) The second multiplication by x(i) needs the result of the first one The most promising speedup method relief on deriving a better starting approximation to 1/d 21/30 The Required Lookup Table The Required Lookup Table can be made smaller, or totally eliminated, by a variety of methods Store the reciprocal values for fewer points Use linear or higher-order interpolation to compute the starting approximation Formulate the starting approximation as a multi-operand addition problem Use or pass through the multiplier’s CSA tree, suitably augmented, to compute it 22/30 16.6. Analysis of Lookup Table Size 23/30 Theorem for Table Size Theorem 16.1: To get w 5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx . . . xxx)two, with w bits after the radix point Based on the theorem, the required table size is 2w × w The cases w < 5: Practically uninteresting (allow smaller table) We can ignore them 24/30 Analysis of Lookup Table Size (1/4) Recall that our objective is to have 1 – 2-w ≦ dx(0+) ≦ 1 + 2-w Let d = (0.1 d-2 d-3) …d-(w+1) d-(w+2) …d-l)two ----------------------w bits to be inspected Theorem 16.1 postulates the existence of x(0+) = (1. x+-1 x+-2 …x+-w)two satisfying the objective inequality 25/30 Analysis of Lookup Table Size (2/4) Let u = (1 d-2 d-3) … d-(w+1))two satisfying 2w ≦ u < 2w+1 We have 2-(w+1) u ≦ d < 2-(w+1) (u+1) Similarly, let v = (1x+-1 x+-2 …x+-w)two The objective inequality can be rewrite as 2w – 1 ≦ dv ≦ 2w + 1 26/30 Analysis of Lookup Table Size (3/4) We derive the following sufficient conditions 2w - 1 ≦ 2-(w+1)uv 2-(w+1) (u+1)v ≦ 2w + 1 The conditions lead to the following restrictions on v 2 w1 2 u w v 2 1 w1 (2 1) u 1 w 27/30 Analysis of Lookup Table Size (4/4) The latter condition is equivalent to 2 w1 2 w 1 2 w1 (2 w 1) u u 1 The last inequality always holds is left as an exercise Completes the “sufficiency” part of the proof At least w bits of d must be inspected x(0+) must have at least w bits after the radix point 28/30 Example Table 16.2 Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Address d = 0.1 xxxx xxxx x (0+) = 1. xxxx xxxx ––––––––––––––––––––––––––––––––––––––––––––––––––––––– 55 0011 0111 1010 0101 64 0100 0000 1001 1001 ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Example: Table entry at address 55 (311/512 d < 312/512) For 8 bits of convergence, the table entry f must satisfy (311/512)(1 + . f) 1 – 2–8 199/311 .f 101/156 (312/512)(1 + . f) 1 + 2–8 or 163.81 ≤ 256 . f ≤ 165.74 Two choices: 164 = (1010 0100)two or 165 = (1010 0101)two 29/30 Reference [1] Behrooz Parhami, “Computer Arithmetic Algorithms and Hardware Designs,” Oxford University Press. 2000. 30/30