CS 267: Applications of Parallel Computers Floating Point Arithmetic and its Impact on Algorithm Design James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 03/04/2009 CS267 Lecture 12

Download Report

Transcript CS 267: Applications of Parallel Computers Floating Point Arithmetic and its Impact on Algorithm Design James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09 03/04/2009 CS267 Lecture 12

CS 267:
Applications of Parallel Computers
Floating Point Arithmetic
and its Impact on Algorithm Design
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr09
03/04/2009
CS267 Lecture 12
Outline
° A little history
° IEEE floating point formats
° Error analysis
° Exception handling
• Using exception handling to go faster
° Uses and costs of extra precision
° Dangers of Parallel and Heterogeneous Computing
° Some class projects
03/02/2009
CS267 Lecture 12
A little history
° Von Neumann and Goldstine - 1947
• “Can’t expect to solve most big [n>15] linear systems without carrying many
decimal digits [d>8], otherwise the computed answer would be completely
inaccurate.” - WRONG!
° Turing - 1949
• “Carrying d digits is equivalent to changing the input data in the d-th place and
then solving Ax=b. So if A is only known to d digits, the answer is as accurate
as the data deserves.”
• Backward Error Analysis
° Rediscovered in 1961 by Wilkinson and publicized (Turing Award 1970)
° Starting in the 1960s- many backward error analyses of various algorithms
° Many years where each machine did FP arithmetic slightly differently
• Both rounding and exception handling differed
• Hard to write portable and reliable software
• Motivated search for industry-wide standard, beginning late 1970s
• First implementation: Intel 8087
° Turing Award 1989 to W. Kahan for design of the IEEE Floating Point
Standards 754 (binary) and 854 (decimal)
• Nearly universally implemented in general purpose machines (but…)
° Revision of IEEE standard in 2008
Defining Floating Point Arithmetic
° Representable numbers
• Scientific notation:
 d.d…d x rexp
• sign bit 
• radix r (2 or 10; 16 has been used in past)
• significand d.d…d (how many base-r digits d?)
• exponent exp (range?)
• others?
° Operations:
• arithmetic: +, -, x, /, ...
-
how to round result to fit in format
• comparison (<, =, >)
• conversion between different formats
-
short to long FP numbers, FP to integer
• exception handling
-
what to do for 0/0, 2*largest_number, etc.
• binary/decimal conversion
-
for I/O, when radix not 10
° Language/library support for these operations
03/02/2009
CS267 Lecture 12
IEEE Floating Point Arithmetic Standard 754 - Normalized Numbers
° Normalized Nonzero Representable Numbers: 1.d…d x 2exp
• Macheps = Machine epsilon = 2-#significand bits = relative error in each operation
• OV = overflow threshold = largest number
• UN = underflow threshold = smallest number
Format # bits #significand bits macheps #exponent bits exponent range
---------- -------- ---------------------------------- -------------------- ---------------------Single
32
23+1
2-24 (~10-7)
8
2-126 - 2127 (~10+-38)
Double
64
52+1
2-53 (~10-16)
11
2-1022 - 21023 (~10+-308)
Double 80
64
2-64(~10-19)
15
2-16382 - 216383 (~10+-4932)
Extended (80 bits on Intel machines)
° Zero: , significand and exponent all zero
• Why bother with -0 later
03/04/2009
CS267 Lecture 12
IEEE Floating Point Arithmetic Standard 754R (2008)
° Added Decimal Numbers (formerly separate standard 854)
• Can easily store 3 decimal digits in 10 bits, so don’t waste much space
° Motivation: “what you see is what you get” in spreadsheets
• What is Excel doing? (next slide)
03/04/2009
CS267 Lecture 12
What is Excel doing?
A1: 1.333333333333330000
A2: 0.333333333333333000
A3: 1.000000000000000000
A4: 0
A5: 0
A6: 0
A7: -2.22045E-16
A8: -1
=4/3
=4/3-1
=(4/3-1)*3
=(4/3-1)*3-1
=A4*(2^52)
=(4/3-1)*3-1
=((4/3-1)*3-1)
=A7*(2^52)
• Excel tries to round internal binary floating point to output
decimal format to look like what it thinks the user wants to see,
rather than the most accurate answer (depending on
parentheses).
Rules for performing arithmetic
° As simple as possible:
• Take the exact value, and round it to the nearest floating point number
(correct rounding)
• Break ties by rounding to nearest floating point number whose bottom
bit is zero (rounding to nearest even)
• Other rounding options too (up, down, towards 0) … for error bounds
° Don’t need exact value to do this!
• Early implementors worried it might be too expensive, but it isn’t
° Applies to
• + ,- ,* , /, sqrt
• conversion between formats
• rem(a,b) = remainder of a after dividing by b
- a = q*b + rem, q = floor(a/b) … rem exact, even if q huge
- cos(x) = cos(rem(x,2*pi)) for |x| >= 2*pi
- cos(x) is exactly periodic, with period = rounded(2*pi)
- Otherwise can’t count on sin(2x)2*sin(x)*cos(x), etc
03/02/2009
CS267 Lecture 12
Error Analysis
° Basic error formula
• fl(a op b) = (a op b)*(1 + ) where
- op one of + , - , * , /
-
|  |   = machine epsilon = macheps
assuming no overflow, underflow, or divide by zero
° Example: adding 4 numbers
• fl(x1+x2+x3+x4) = {[(x1+x2)*(1+1) + x3]*(1+2) + x4}*(1+3)
= x1*(1+1)*(1+2)*(1+3) + x2*(1+1)*(1+2)*(1+3)
+ x3*(1+2)*(1+3) + x4*(1+3)
≡ x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4)
where each |ei| <~ 3*macheps
• get exact sum of slightly changed summands xi*(1+ei)
• Backward Error Analysis - algorithm called numerically stable if it
gives the exact result for slightly changed inputs
- Holds for many standard algorithms (eg linear algebra)
° Part of deciding whether to trust simulation results
Example: polynomial evaluation using Horner’s rule
n
° Horner’s rule to evaluate p = S ck * xk
k=0
• p = cn, for k=n-1 downto 0, p = x*p + ck
° Numerically Stable:
^ * xk where ^
• Get ^p = S c
ck = ck (1+ek) and | ek |  (n+1) 
k
° Apply to (x-2)9 = x9 - 18*x8 + … - 512
03/04/2009
CS267 Lecture 12
Example: polynomial evaluation (continued)
° (x-2)9 = x9 - 18*x8 + … - 512
° We can compute error bounds using
• fl(a op b)=(a op b)*(1+)
03/04/2009
CS267 Lecture 12
General approach to error analysis
° Suppose we want to evaluate x = f(z)
• Ex: z = (A,b), x = solution of linear system Ax=b
° Suppose we use a backward stable algorithm alg(z)
• Ex: Gaussian elimination with pivoting
° Then alg(z) = f(z + e) where backward error e is “small”
° Error bound from Taylor expansion (scalar case):
•
•
•
•
alg(z) = f(z+e)  f(z) + f’(z)*e
Absolute error = |alg(z) – f(z)|  |f’(z)|* |e|
Relative error = |alg(z) – f(z)| / |f(z)|  |f’(z)*z/f(z)|* |e/z|
Condition number = |f’(z)*z/f(z)|
• Relative error (of output) = condition number * relative error of input |e/z|
° Applies to multivariate case too
• Ex: Gaussian elimination with pivoting for solving Ax=b
- Condition number = ||A||·||A-1|| = 1/distance(A,{singular matrices})
03/04/2009
CS267 Lecture 12
(Old) Cray Arithmetic – gone and unlamented
° Historically very important
• Crays among the fastest machines
• Other fast machines emulated it (Fujitsu, Hitachi, NEC)
° Sloppy rounding
• fl(a + b) not necessarily (a + b)(1+) but instead
fl(a + b) = a*(1+a) + b*(1+b) where |a|,|b| <= macheps
• Means that fl(a+b) could be either 0 when should be nonzero, or twice too large
when a+b “cancels”
• Sloppy division too
° Some impacts:
• arccos(x/sqrt(x2 + y2)) can yield exception, because x/sqrt(x2 + y2) >1
-
Not with IEEE arithmetic
• Fastest (at one time) eigenvalue algorithm in LAPACK failed
-
Needs Pk (ak - bk) accurately
-
Needed to preprocess by setting each ak = 2*ak - ak (killed bottom bit on Cray)
° Crays now do IEEE arithmetic (use off-the-shelf CPUs)
° More cautionary tales:
• www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf
03/04/2009
CS267 Lecture 12
What happens when the “exact value” is
not a real number, or is too small or too
large to represent accurately?
You get an “exception”
03/04/2009
CS267 Lecture 12
Exception Handling
° What happens when the “exact value” is not a real
number, or too small or too large to represent
accurately?
° 5 Exceptions:
• Overflow - exact result > OV, too large to represent
• Underflow - exact result nonzero and < UN, too small to represent
• Divide-by-zero - nonzero/0
• Invalid - 0/0, sqrt(-1), …
• Inexact - you made a rounding error (very common!)
° Possible responses
• Stop with error message (unfriendly, not default)
• Keep computing (default, but how?)
03/04/2009
CS267 Lecture 12
IEEE Floating Point Arithmetic Standard 754 - “Denorms”
° Denormalized (or Subnormal) Numbers: 0.d…d x 2min_exp
• sign bit, nonzero significand, minimum exponent
• Fills in gap between UN and 0
° Underflow Exception
• occurs when exact nonzero result is less than underflow threshold UN
• Ex: UN/3
• return a denorm, or zero
° Why bother?
• Necessary so that following code never divides by zero
if (a ≠ b) then x = a/(a-b)
03/04/2009
CS267 Lecture 12
IEEE Floating Point Arithmetic Standard 754 -  Infinity
°  Infinity: Sign bit, zero significand, maximum exponent
° Overflow Exception
• occurs when exact finite result too large to represent accurately
• Ex: 2*OV
• return infinity
° Divide by zero Exception
• return infinity = 1 / 0
• sign of zero important! Example later…
° Also return  infinity for
• 3+infinity, 2*infinity, infinity*infinity
• Result is exact, not an exception!
03/04/2009
CS267 Lecture 12
IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number)
° NAN: Sign bit, nonzero significand, maximum exponent
° Invalid Exception
• occurs when exact result not a well-defined real number
• 0/0
• sqrt(-1)
• infinity-infinity, infinity/infinity, 0*infinity
• NAN + 3
• NAN > 3?
• Return a NAN in all these cases
° Two kinds of NANs
• Quiet - propagates without raising an exception
-
good for indicating missing data
-
Ex: max(3,NAN) = 3
– Two kinds of max, min in 2008 standard
• Signaling - generate an exception when touched
03/04/2009
good for detecting uninitialized data
CS267 Lecture 12
Exception Handling User Interface
° Each of the 5 exceptions has the following features
• A sticky flag, which is set as soon as an exception occurs
• The sticky flag can be reset and read by the user
reset overflow_flag and invalid_flag
perform a computation
test overflow_flag and invalid_flag to see if any exception occurred
• An exception flag, which indicate whether a trap should occur
-
Not trapping is the default
-
Instead, continue computing returning a NAN, infinity or denorm
-
On a trap, there should be a user-writable exception handler with
access to the parameters of the exceptional operation
-
Trapping or “precise interrupts” like this are rarely implemented for
performance reasons.
– Alas, sometimes even “not trapping” is slow when inputs/output
are exceptional (eg denorms)
03/04/2009
CS267 Lecture 12
Exploiting Exception Handling to Design Faster Algorithms
° Paradigm:
1) Try fast, but possibly “risky” algorithm
2) Quickly test for accuracy of answer (use exception handling)
3) In rare case of inaccuracy, rerun using slower “low risk” algorithm
° Quick with high probability
• Assumes exception handling done quickly! (Will manufacturers do this?)
° Ex: Solving triangular system Tx=b
• Part of BLAS2 - highly optimized, but risky
• If T “nearly singular”, as when computing condition numbers, expect
very large x, so scale inside inner loop: slow but low risk
• Use paradigm with sticky flags to detect nearly singular T
• Up to 9x faster on Dec Alpha
° Demmel/Li: “Faster Num. Algs. Via Exception Handling”
•
crd.lbl.gov/~xiaoye
Finding Eigenvalues of tridiagonal T
° T symmetric, tridiagonal, n x n matrix
• T(i,i) = d(i), T(i,i+1) = T(i+1,i) = e(i)
• find eigenvalues z(1), … , z(n)
° Part of standard algorithm for eigenvalues of
any real symmetric matrix A = AT
° Count(s) = # eigenvalues ≤ s
° Start with interval [a,b] containing all z(i)
° Evaluate Count((a+b)/2)
° Repeatedly bisect nonempty intervals to find
narrower intervals containing eigenvalues
° Stop when intervals narrow enough
Implementing Count(s) =
# eigenvalues(T) ≤ s
count = 0, e(0) = 0, x=1
for i = 1 to n
x = d(i) – s – (e(i-1))2/x
if (x<0) count = count + 1
•
What could go wrong?
• Overflow
• What if x = -0?
• If Count(s) nonmonotonic?
Versions of Count(s) (IEEE TRs 2007-179, CSD-05-1414)
 Slow, careful version
count = 0, x=1
scale d(i), e(i), s
for i = 1 to n
x = d(i) – s – (e(i-1))2 /x
if |x| < tol, x = -tol
if (x<0) count = count + 1
 Fast version (assuming fast,
count = 0, x=1
scale d(i), e(i), s
for i = 1 to n
x = d(i) – s – (e(i-1))2 / x
… depend on infinity arithmetic
count = count + signbit(x) … count -0
correct IEEE arithmetic)
 (ATI) GPU version
(arithmetic could create
-0 incorrectly)
count = 0, x=1
scale d(i), e(i), s
for i = 1 to n
x = d(i) – s – (e(i-1))2 / x + 0 … avoid -0
if (x<0) count = count + 1
 Thm: Count(s) monotonic if arithmetic monotonic
 Else need to do Count(s(1:n)) = parallel_prefix_max(Count(s(1:n))
Summary of Values Representable in IEEE Binary FP
° Zero

0…0
° Normalized nonzero numbers

Not 0 or
all 1s
anything
° Denormalized numbers

0…0
nonzero
°  Infinity

1….1 0……………………0
° NANs

1….1
0……………………0
nonzero
• Signaling and quiet
• Many systems have only quiet
° Representation such that a<b true if a and b treated
as (non-NAN) floats or integers
03/04/2009
CS267 Lecture 12
Exploiting Mixed and Extra Precision
 On most platforms, single precision faster than double
Size
SGEMM/
DGEMM
Size
SGEMV/
DGEMV
3000
2.00
5000
1.70
UltraSparc-IIe
Intel PIII
Coppermine
3000
1.64
5000
1.66
3000
2.03
5000
2.09
PowerPC 970
Intel
Woodcrest
3000
2.04
5000
1.44
3000
1.81
5000
2.18
Intel XEON
Intel Centrino
Duo
3000
2.04
5000
1.82
3000
2.71
5000
2.21
AMD Opteron
246
 Why? Faster FP HW, less data motion, better locality
 Bigger speed gaps on GPUs (8X), Cell (10X)
 Still bigger speed gaps between double, extra
03/04/2009
CS267 Lecture 12
Data courtesy of Jack Dongarra
Exploiting Speed Gaps between Precisions
 Goal: Get answer good to double (or extra), but at
speed of single (or double)
 Variant: Get more accurate answer at same speed
 Approach
 Do most work in single (say) to get first approximation
 “Clean up” answer by fast iterative process
 Typically variant of Newton’s Method
 Example: Solve Ax=b
Factor A = L*U
Solve x = U-1 * (L-1 * b)
r = A*x – b
While ||r|| not small enough
Solve e = U-1 * (L-1 * r)
x=x–e
r = A*x – b
… in single (O(n3) flops)
… in single (O(n2) flops)
… in double (O(n2) flops)
… in single (O(n2) flops)
… in double (O(n) flops)
… in double (O(n2) flops)
 Converges if A not too “ill-conditioned”
 ||A||·||A-1|| < 107 is ok, can fail if ||A||·||A-1|| > 107
Results for Mixed Precision Iterative Refinement for Dense Ax = b
Architecture (BLAS)
1
2
3
4
5
6
7
8
9
10
11
Architecture (BLAS-MPI)
Intel Pentium III Coppermine (Goto)
Intel Pentium III Katmai (Goto)
Sun UltraSPARC IIe (Sunperf)
Intel Pentium IV Prescott (Goto)
Intel Pentium IV-M Northwood (Goto)
AMD Opteron (Goto)
Cray X1 (libsci)
IBM Power PC G5 (2.7 GHz) (VecLib)
Compaq Alpha EV6 (CXML)
IBM SP Power3 (ESSL)
SGI Octane (ATLAS)
# procs
n
DP Solve
/SP Solve
DP Solve
/Iter Ref
#
iter
AMD Opteron (Goto – OpenMPI MX)
32
22627
1.85
1.79
6
AMD Opteron (Goto – OpenMPI MX)
64
32000
1.90
1.83
6
Slide courtesy of Jack Dongarra
Alternatively: Guaranteed accuracy for Ax=b
Conventional Gaussian Elimination
1/
With extra precise
iterative refinement and
right stopping condition

 = n1/2 2-24
norm = ||A|| · ||A-1||
Simulating extra precision
° What if 64 or 80 bits is not enough?
• Sometimes only known way to get right answer (Ax=b, mesh generation)
• Very large problems on very large machines may need more
• Sometimes you can trade communication for extra precision
° Can simulate high precision efficiently just using floating point
° Each extended precision number s is represented by an array (s1,s2,…,sn)
• each sk is a FP number
• s = s1 + s2 + … + sn in exact arithmetic
• s1 >> s2 >> … >> sn
° Ex: Computing (s1,s2) = a + b
if |a|<|b|, swap them
s1 = a+b
… roundoff may occur
s2 = (a - s1) + b
… no roundoff!
• s1 contains leading bits of a+b, s2 contains trailing bits
° Systematic algorithms for higher precision
•
•
•
•
Priest / Shewchuk (www.cs.berkeley.edu/~jrs)
Bailey / Li / Hida (crd.lbl.gov/~dhbailey/mpdist/index.html)
D. / Li et al (crd.lbl.gov/~xiaoye/XBLAS)
D. / Hida / Riedy / Li (www.cs.berkeley.edu/~yozo)
Simulating extra precision – how little do you need?
 How much extra precision do we need to compute an arbitrary dot
product accurately?
 For simplicity, consider S = x1 + … + xn
 Usual error bound says error in computed S is O(macheps)(|x1| + … + |xn|)
 So if S << |x1| + … + |xn| (“massive cancellation”) don’t expect (m)any correct
leading bits
 Ex: 2-1000 + 1 + (-1) yields 0 instead of 2-1000 when summed left to right

Unless you carry 1000 bits of precision
 Thm: Suppose each xi has an f-bit fraction, and register R
has an F-bit fraction, with F>f. Then following algorithm:
Sort the xi’s in decreasing order of exponent (or magnitude)
R = 0; for i=1 to n, R = R + xi
gets the right answer (nearly f of R’s bits are correct) as long
as n ≤ N ≡  2F-f / ( 1-2-f )  + 1  2F-f + 1.
If n  N+2, all accuracy may be lost (eg get 0 instead of ≠0)
 Ex: f = 53, F = 64  n≤N = 2049 words ok, not 2051
 Lots of variants (EECS TR CSD-02-1180)
Hazards of Parallel and Heterogeneous Computing
° What new bugs arise in parallel floating point programs?
° Ex 1: Nonrepeatability
• Makes debugging hard!
° Ex 2: Different exception handling
• Can cause programs to hang
° Ex 3: Different rounding (even on IEEE FP machines)
• Can cause hanging, or wrong results with no warning
° See LAPACK Working Note 112
• www.netlib.org/lapack/lawns
03/04/2009
CS267 Lecture 12
Hazard #1: Nonrepeatability due to nonassociativity
° Consider s= all_reduce(x,”sum”) = x1 + x2 + … + xp
° Answer depends on order of FP evaluation
• All answers differ by at most p*macheps*(|x1| + … + |xp|)
• Some orders may overflow/underflow, others not!
° How can order of evaluation change?
• Change number of processors
• In reduction tree, have each node add first available child sum to its
own value
- order of evaluation depends on race condition, unpredictable!
• Repeatability recommended but not required by MPI 1.1 standard
° Options
• Live with it, since difference likely to be small
• Build slower version of all_reduce that guarantees evaluation order
independent of #processors, use for debugging
03/04/2009
CS267 Lecture 12
Hazard #2: Heterogeneity: Different Exception Defaults
° Not all processors implement denorms fast
• (old) DEC Alpha 21164 in “fast mode” flushes denorms to zero
-
in fast mode, a denorm operand causes a trap
slow mode, to get underflow right, slows down all
operations significantly, so rarely used
• SUN Ultrasparc in “fast mode” handles denorms correctly
- handles underflow correctly at full speed
-
flushing denorms to zero requires trapping, slow
° Imagine a cluster built of (old) DEC Alphas and SUN
Ultrasparcs
• Suppose the SUN sends a message to a DEC containing a
denorm: the DEC will trap
• Avoiding trapping requires running either DEC or SUN in slow
mode
• Good news: most machines converging to fast and correct
underflow handling, part of C99 standard
03/04/2009
CS267 Lecture 12
Hazard #3: Heterogeneity: Data Dependent Branches
° Different “IEEE machines” may round differently
• Intel uses 80 bit FP registers for intermediate computations
• IBM RS6K and later platforms have FMA= Fused Multiply-Add instruction
-
d = a*b+c with one rounding error, i.e. a*b good to 104 bits
-
Added to 2008 standard
• SUN has neither “extra precise” feature
• Different compiler optimizations may round differently
° Impact: same expression can yield different values on different machines
Compute s redundantly
or
s = reduce_all(x,min)
if (s > 0) then
compute and return a
else
communicate
compute and return b
end
° Taking different branches can yield nonsense, or deadlock
• How do we fix this example? Does it always work?
03/04/2009
CS267 Lecture 12
Class Projects Available (more later)
 Designing, building, evaluating FP debugging tools
 Identifying possible heterogeneity hazards
 Automatically modifying code to selectively raise
precision, to search for “unstable” sections
 If “standard precision” version of code gives much
different answers than “high precision” version, find
smallest section of code where raising precision most
affects answer
 See J. D. or Koushik Sen
 Extend extra precision Ax=b algorithms to
ScaLAPACK
03/04/2009
CS267 Lecture 12
Further References on Floating Point Arithmetic
° Notes for Prof. Kahan’s CS267 lecture from 1996
• www.cs.berkeley.edu/~wkahan/ieee754status/cs267fp.ps
• Note for Kahan 1996 cs267 Lecture
° Prof. Kahan’s “Lecture Notes on IEEE 754”
• www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps
° Prof. Kahan’s “The Baleful Effects of Computer
Benchmarks on Applied Math, Physics and
Chemistry
• www.cs.berkeley.edu/~wkahan/ieee754status/baleful.ps
° Notes for Demmel’s CS267 lecture from 1995
• www.cs.berkeley.edu/~demmel/cs267/lecture21/lecture21.html
03/04/2008
CS267 Lecture 12