absterdabster

Posted on Jan 27

Summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic"

#programming #architecture #systems #math

Hi again! Have you ever used a floating point number in your code? They appear in the forms of float or double usually, but essentially it's a type of data to represent real numbers (like 0.1 or 3.14159653589 or 123456789 * 10^(-23)). While it can represent decimals, it can also do whole numbers like 1 or 12345678.

Regardless of which one you used, there is a chance your code might be in trouble. When you use a number (like 1.5), your computer might not actually be using that number but instead something really close.

Now multiply your wrong number a few times, add it with a few other wrong numbers, and soon you're math is chaos! Your computer isn't actually listening to you.

How do we reduce errors with floating point operations?

Today I'll be summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg in 1991. Give it a read if you dare.... hehe

Ok that blog was very long.... so I'mma just cover some of the basic points of what I read. Drop a comment if you want me to try and explain a section from the blog that I didn't cover (I'll try lol).

Get ready to get smarter :)

Representing Floating Points

Floating points are represented with a numerical base $\beta$ (like decimal or binary or hexadecimal), an exponent range $e_{min}$ and $e_{max}$ , and a precision p (the number of digits). They are represented in scientific notation with a single non-zero digit before the decimal point.

For example, it would look something like this.

d_0.d_1d_2...d_{p-1} * \beta^e

The digits of the significand $d_i$ are all in the range $0 \leq d_i \lt \beta$ where there are p digits ( $0 \leq i \lt p$ ).

So if $\beta = 2$ and p = 3 and I wanted to represent 0.77 it would be something like this.

1.10 * 2^{-1}

This would be equal to 0.110 which is

1/2 + 1/4

. This is the closest we could get to 0.77 with 3 digit precision p in binary. As you can see, floating point representations lie 0.75 is not 0.77, butttt it is close enough.

If $\beta = 10$ and p = 3 then 2734 would be represented as

2.73 * 10^1

If we keep saying floating points are "close enough" for our numbers and we start doing operations on them, eventually our representations will be far from the actual number.

Okay, let's measure how far we are from the actual number. AKA, what is the error?

Understanding the error

There are two types of floating point errors or rounding errors that are commonly measured. ulps (units in the last place) and relative error.

ulps

The units in the last place is the total error the last digit is off by compared to the actual number. To be exact, it can be calculated by this complicated formula where z is the actual number we are comparing to.

\text{ulps} = |d.dd...d - z/{\beta^e}|\beta^{p-1}

If this looks confusing, let's just do an example and it'll be a lot easier. Let's say we have a number 314.09 and our z = 314.1592653589.

314.09 = 3.1409 * 10^2

From this we know, $\beta = 10$ , p = 5, and e = 2.

\text{ulps} = |3.1409 - 314.1592653589/{10^2}|{10^{4}}

= |3.1409 - 3.14159653989|{10^{4}}

= |-0.00069653989|{10^{4}}

= 6.9653989

This has roughly ulps = 6.965.

Relative error

This error takes the absolute error and takes it into proportion to the magnitude of the real number.

Super simple. First take the absolute error (the difference between the actual and the representation).

\text{absolute error} = |(d.dd..d * \beta^e) - z|

Now divide that by the real number to take it into proportion.

\text{relative error} = \text{absolute error}/z

\text{relative error} = |(d.dd..d * \beta^e) - z|/z

The idea is that if I have a very large number like a million. If I wanted to buy a million dollar home (if I'm ever rich enough), I wouldn't really mind a difference in 10$.

Converting 0.5 ulps to relative error

Let's say I have a number represented as $d.dd...dd * \beta^e$ . If this number had 0.5 ulps, the error could be bounded by $0.00...00\beta'$ where $\beta' = \beta/2$ .

Convincing you about the 0.5 ulps absolute error

Let me convince you. Let's say we're in base 10 ( $\beta=10$ ) and we had p=3 with the number 9.97. If the actual number was properly represented via rounding, the actual number would have been between >= 9.965 and < 9.975.

The limits of the actual number are shown to be bounded by the error 0.005. This error is also 0.5 ulps. It is also the same as $0.00...00\beta'$ because $\beta' = \beta/2 = 5$ .

This might seem obvious in base 10, but let's try something in base 2 ( $\beta=2$ ). Let's say we had 1.11 (1.75 in decimal) when $\beta=2$ and p = 3. However, the real number before rounding could have been between < 1.111 and >= 1.110. And of course this range results in 0.1 ulps ex. (1.111 - 1.11) in binary which is 0.5 ulps in decimal which is the same as $0.00...00\beta'$ because $\beta' = \beta/2 = 1$ .

Therefore, if a number was rounded properly to its proper representation, it would have an error of < 0.5 ulps.

In other words, 0.5 ulps is always has equal to $((\beta/2)\beta^{-p}) * \beta^e$ . This would be the absolute error of 0.5 ulps.

Onto the relative error

Now, what is 0.5 ulps in relative error? Remember that relative error was absolute error divided by the actual number.

We just said that the absolute error when we have 0.5 ulps is $((\beta/2)\beta^{-p}) * \beta^e$ . But the actual number could be anything.

Specifically, it could be in the range $1 * \beta^e$ and $\beta * \beta^e$ . So, if 0.5 ulps was 0.001 in binary ( $1*2^{-3} *2^0$ ), then p=3 and e=0. In that case the real number must have been between $1 * 2^0$ and $2*2^0$ which would be 1 and 2 in decimal or 1 and 10 in binary.

From our previous example, we can see that is true. 1.11 (1.75 decimal) was in fact between 1 and 10 in binary.

Cool so we set some bounds on what the actual number could have been, namely: $1 * \beta^e$ and $\beta * \beta^e$ .

This means we can set bounds for the relative error for 0.5 ulps. So let's divide the absolute error with the bounds of the real number.

Upper bound of relative error:

\frac{((\beta/2)\beta^{-p}) * \beta^e}{(1*\beta^e)}

= (\frac{\beta}{2})\beta^{-p}

Lower bound of relative error:

\frac{((\beta/2)\beta^{-p}) * \beta^e}{(\beta*\beta^e)}

= (\frac{1}{2})\beta^{-p}

Therefore,

(\frac{1}{2})\beta^{-p} \lt 0.5 ulps \le (\frac{\beta}{2})\beta^{-p}

Machine epsilon

The upper bound relative error for 0.5 ulps is called machine epsilon. This is the largest relative error possible when given a base.

\epsilon = (\frac{\beta}{2})\beta^{-p}

Larger precision p, as expected, implies smaller relative error/machine epsilon. We also notice that 0.5 ulps is bounded by machine epsilon and $(\frac{1}{2})\beta^{-p}$ . These bounds have a factor of $\beta$ which we call wobble.

Yeahhh... wobble baby wobble baby...

BTW, machine epsilon is such a cool name. I'm not judging if you name your dog or child machine epsilon.

Relative errors with machine epsilon

Remember machine epsilon was the upper bound for rounding errors or 0.5 ulps. So if we actually got a relative error much lower, we can represent the relative error as a ratio of the machine epsilon like this:

\text{rel. error} = k * \epsilon

Let's do an example. If I had the number 3.14159 to represent with $\beta=10$ and p = 3, I would have to round to 3.14. This would have an absolute error of .00159 or 0.159 ulps. For relative error, I do $0.00159/3.14159$ which leads me to a relative error of 0.0005.

Now, to find the ratio, we must find the machine epsilon:

\epsilon = (\frac{\beta}{2})\beta^{-p}

= 5 * 10^{-3} = 0.005

So... the ratio is:

k = \text{rel. error}/\epsilon = 0.0005/0.005 = 0.1

So we say that the relative error is $0.1\epsilon$ .

The Wobble

Get ready for things to wobble. First let me show you how ulps and relative error react to each other.

Using 1.0 to represent 1.04 in decimal, has an error of 0.4 ulps and relative error of 0.038. The machine epsilon is 0.05 which makes the relative error $0.76\epsilon$ .

Great! Hopefully this made sense so far.

Now, let's multiply our number by let's say 8. The actual number would be 8.32 while the calculated number would be 8.0. This has 3.2 ulps which is 8 times larger than before! However, our relative error is still $0.32/8.32 = 0.038$ which is the same as $0.76\epsilon$ .

Whoa! Our ulps increased, but our relative error was the same?

Yep. It turns out whenever you have a fixed relative error, you're ulps can wobble by $\beta$ .

On the other hand, whenever we have a fixed ulps (like we showed earlier with 0.5 ulps), the relative error had bounds which showed it can also wobble by $\beta$ .

So, smaller the $\beta$ , smaller the wobble or smaller the error bounds! Using binary, can significantly reduce our error.

Contaminated digits

We now know that ulps and relative error's ratio k vary from each other by a factor of $\beta$ , the wobble. As a result, we can estimate the number of contaminated digits (the number of incorrect digits from the correct representation of the number).

\text{contaminated digits} \approx \log_{\beta}{n}

n is the number of ulps. n can also mean k, the ratio between the relative error and $\epsilon$ . It can mean either because of the wobble factor.

So if I had a number in decimal, 3.10 with p=3 and it was trying to represent 3.1415, it would have an error of 4.15 ulps. The contaminated digits would be roughly $\log_{10}{4.15}$ which is roughly 0.61804809 digits.

LOL we can't have partial digits! We'll see that when pigs fly.

Visually looking, we can see that it is wrong in 1 digit, the last one, which is pretty close to what we got from our calculation.

Guard digits

Let's subtract 2 values when $\beta = 10$ and p=3.

x = 1.01 * 10^{0}

y = 9.93 * 10^{-1}

x - y = 1.01 - 0.99 = 0.02

It becomes 0.99 and not 0.993 because we had to lose some data with p=3 so that they could be subtracted from each other at the same $\beta^{e}$ .

As you know, the actual answer 0.017, but the answer ended up being 0.02. So $2.00 * 10^{-2}$ and $1.70 * 10^{-2}$ have an error of 30 ulps!

The relative error from this kind of subtraction is bounded by $\beta - 1$ . Let me show you why.

If $x=1.00...00, y=\rho.\rho\rho...\rho\rho * \beta^{-1}$ where $\rho=\beta-1$ . (ex. $\beta=10, \rho=9$ ). (x and y have p digits.)

If I subtracted them, I should get the actual answer of $1*\beta^{-p}$ , but because we shift y to the right and lose a digit, we end up getting $1*\beta^{-p+1}$ .

\text{absolute err.} = |\beta^{-p} - \beta{-p+1}| = |\beta^{-p}(1 - \beta)|

\text{relative err.} = \text{abs. error}/z

= \frac{|\beta^{-p}(1 - \beta)|}{\beta^{-p}} = \beta-1

If our $\beta=2$ that would make our relative error 1. In terms of $\epsilon$ , it would mean $1 = k*\epsilon$ and so $k = 1/\epsilon$ . The contaminated digits would be $log_2{1/\epsilon}=log_2{2^{p}}=p$ . If p digits are contaminated, all of them are contaminated.

Was there a way we could have avoided some of this error? Okay, fine, there iss.... otherwise I wouldn't have written about this section...

Just add a temporary extra digit. And let's call it a ... wait for it ... a guard digit. suprise surprise

x = 1.01 * 10^{0}

y = 9.93 * 10^{-1}

x - y = 1.010 - 0.993 = 0.017

Now we have no error now. This is now a bit better than before.

Turns out the guard digit bounds the relative error to $2\epsilon$ . I'm lazy but if someone in the comments asks, I'll figure out why and try to explain it (but its in the linked blog).

Benign and Catastrophic Cancellation

When we try to subtract two really close numbers, many of the digits cancel out and become 0. We call this cancellation. Sometimes cancellation can be catastrophic or benign.

Sometimes, when we do subtraction, there are often errors on the later, far right digits (the least significant digits) after rounding the value or after prior operations. The more accurate digits are at the front (the most significant digits). While the more significant digits at the front cancel out, the lesser accurate lower significant digits would have to subtract and produce an even more inaccurate value. (Like when you calculate the determinant $b^2 - 4ac$ ).

The catastrophic cancellation just exposes the rounding errors from prior operations.

Benign cancellation happens when you subtract numbers that have no rounding errors.

IEEE Standard

So the IEEE standard is a set of rules that many systems follow to ensure consistency. There are two IEEE standards that are followed: IEEE 754 and IEEE 854. They both support smaller and larger floating points called single precision and double precision.

IEEE 754

The standard allows $\beta=2$ . It has single bit precision (p=24) and double bit precision (p=53). It also discusses how the bits should be laid out.

In fact, here is a cool table that shows how IEEE 754 sets all its floating point parameters.

Exponents are represented with a sign/magnitude split. One bit is used for the sign of the exponent. The remaining bits for the exponent are used to represent its magnitude. Two's complement is another approach but is not used by either IEEE standard.

In fact, here is exactly how the bits are laid out to represent different kinds of values. To represent 0, you have to use $e_{\text{min}}-1$ . Infinity is $e_{\text{max}}+1$ without the fractional section 0ed out. NaN is another type (like when 0 is divided by 0, or infinities are added). NaN is represented the same as infinity but with the fraction section set.

The fractional section is the digits after the first digit (also called the significand.

IEEE 854

On the other hand, this standard allows $\beta=2$ or $\beta=10$ . However, there is no rules about how the bits should be laid out for double and single precision.

It allows base 10 because it is also the standard way humans count. Base 2 is also included because of the low wobble.

The Conclusion

Okay, I kinda rushed the last section. But overall, I wanted to say floating points can have a lot of imprecision. If you can avoid them, use integers instead.

In the case that you do use them, try to limit your wobbles and avoid catastrophic cancellations (there are ways you could do it sometimes by rearranging formulas).

Try to read the original blog yourself. This blog you are reading right now is a summary of a fraction of the original blog by David Goldberg.

But as always, drop your questions and comments. And I'm out for now...

Peace
-absterdabster

DEV Community