DEV Community

Cover image for Summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic"
absterdabster
absterdabster

Posted on

1 1 1

Summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic"

Hi again! Have you ever used a floating point number in your code? They appear in the forms of float or double usually, but essentially it's a type of data to represent real numbers (like 0.1 or 3.14159653589 or 123456789 * 10^(-23)). While it can represent decimals, it can also do whole numbers like 1 or 12345678.

Regardless of which one you used, there is a chance your code might be in trouble. When you use a number (like 1.5), your computer might not actually be using that number but instead something really close.

Now multiply your wrong number a few times, add it with a few other wrong numbers, and soon you're math is chaos! Your computer isn't actually listening to you.

How do we reduce errors with floating point operations?

Today I'll be summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg in 1991. Give it a read if you dare.... hehe

Ok that blog was very long.... so I'mma just cover some of the basic points of what I read. Drop a comment if you want me to try and explain a section from the blog that I didn't cover (I'll try lol).

Get ready to get smarter :)

Representing Floating Points

Floating points are represented with a numerical base β\beta (like decimal or binary or hexadecimal), an exponent range emine_{min} and emaxe_{max} , and a precision p (the number of digits). They are represented in scientific notation with a single non-zero digit before the decimal point.

For example, it would look something like this.

d0.d1d2...dp1βe d_0.d_1d_2...d_{p-1} * \beta^e

The digits of the significand did_i are all in the range 0di<β0 \leq d_i \lt \beta where there are p digits ( 0i<p0 \leq i \lt p ).

So if β=2\beta = 2 and p = 3 and I wanted to represent 0.77 it would be something like this.

1.1021 1.10 * 2^{-1}

This would be equal to 0.110 which is 1/2+1/41/2 + 1/4 . This is the closest we could get to 0.77 with 3 digit precision p in binary. As you can see, floating point representations lie 0.75 is not 0.77, butttt it is close enough.

If β=10\beta = 10 and p = 3 then 2734 would be represented as

2.731012.73 * 10^1

If we keep saying floating points are "close enough" for our numbers and we start doing operations on them, eventually our representations will be far from the actual number.

Okay, let's measure how far we are from the actual number. AKA, what is the error?

Understanding the error

There are two types of floating point errors or rounding errors that are commonly measured. ulps (units in the last place) and relative error.

ulps

The units in the last place is the total error the last digit is off by compared to the actual number. To be exact, it can be calculated by this complicated formula where z is the actual number we are comparing to.

ulps=d.dd...dz/βeβp1\text{ulps} = |d.dd...d - z/{\beta^e}|\beta^{p-1}

If this looks confusing, let's just do an example and it'll be a lot easier. Let's say we have a number 314.09 and our z = 314.1592653589.

314.09=3.1409102314.09 = 3.1409 * 10^2

From this we know, β=10\beta = 10 , p = 5, and e = 2.

ulps=3.1409314.1592653589/102104\text{ulps} = |3.1409 - 314.1592653589/{10^2}|{10^{4}}

=3.14093.14159653989104= |3.1409 - 3.14159653989|{10^{4}}

=0.00069653989104= |-0.00069653989|{10^{4}}

=6.9653989= 6.9653989

This has roughly ulps = 6.965.

Relative error

This error takes the absolute error and takes it into proportion to the magnitude of the real number.

Super simple. First take the absolute error (the difference between the actual and the representation).

absolute error=(d.dd..dβe)z\text{absolute error} = |(d.dd..d * \beta^e) - z|

Now divide that by the real number to take it into proportion.

relative error=absolute error/z\text{relative error} = \text{absolute error}/z

relative error=(d.dd..dβe)z/z\text{relative error} = |(d.dd..d * \beta^e) - z|/z

The idea is that if I have a very large number like a million. If I wanted to buy a million dollar home (if I'm ever rich enough), I wouldn't really mind a difference in 10$.

Converting 0.5 ulps to relative error

Let's say I have a number represented as d.dd...ddβed.dd...dd * \beta^e . If this number had 0.5 ulps, the error could be bounded by 0.00...00β0.00...00\beta' where β=β/2\beta' = \beta/2 .

Convincing you about the 0.5 ulps absolute error

Let me convince you. Let's say we're in base 10 ( β=10\beta=10 ) and we had p=3 with the number 9.97. If the actual number was properly represented via rounding, the actual number would have been between >= 9.965 and < 9.975.

The limits of the actual number are shown to be bounded by the error 0.005. This error is also 0.5 ulps. It is also the same as 0.00...00β0.00...00\beta' because β=β/2=5\beta' = \beta/2 = 5 .

This might seem obvious in base 10, but let's try something in base 2 ( β=2\beta=2 ). Let's say we had 1.11 (1.75 in decimal) when β=2\beta=2 and p = 3. However, the real number before rounding could have been between < 1.111 and >= 1.110. And of course this range results in 0.1 ulps ex. (1.111 - 1.11) in binary which is 0.5 ulps in decimal which is the same as 0.00...00β0.00...00\beta' because β=β/2=1\beta' = \beta/2 = 1 .

Therefore, if a number was rounded properly to its proper representation, it would have an error of < 0.5 ulps.

In other words, 0.5 ulps is always has equal to ((β/2)βp)βe((\beta/2)\beta^{-p}) * \beta^e . This would be the absolute error of 0.5 ulps.

Onto the relative error

Now, what is 0.5 ulps in relative error? Remember that relative error was absolute error divided by the actual number.

We just said that the absolute error when we have 0.5 ulps is ((β/2)βp)βe((\beta/2)\beta^{-p}) * \beta^e . But the actual number could be anything.

Specifically, it could be in the range 1βe1 * \beta^e and ββe\beta * \beta^e . So, if 0.5 ulps was 0.001 in binary ( 123201*2^{-3} *2^0 ), then p=3 and e=0. In that case the real number must have been between 1201 * 2^0 and 2202*2^0 which would be 1 and 2 in decimal or 1 and 10 in binary.

From our previous example, we can see that is true. 1.11 (1.75 decimal) was in fact between 1 and 10 in binary.

Cool so we set some bounds on what the actual number could have been, namely: 1βe1 * \beta^e and ββe\beta * \beta^e .

This means we can set bounds for the relative error for 0.5 ulps. So let's divide the absolute error with the bounds of the real number.

Upper bound of relative error:

((β/2)βp)βe(1βe)\frac{((\beta/2)\beta^{-p}) * \beta^e}{(1*\beta^e)}

=(β2)βp= (\frac{\beta}{2})\beta^{-p}

Lower bound of relative error:

((β/2)βp)βe(ββe)\frac{((\beta/2)\beta^{-p}) * \beta^e}{(\beta*\beta^e)}

=(12)βp= (\frac{1}{2})\beta^{-p}

Therefore,

(12)βp<0.5ulps(β2)βp(\frac{1}{2})\beta^{-p} \lt 0.5 ulps \le (\frac{\beta}{2})\beta^{-p}

Machine epsilon

The upper bound relative error for 0.5 ulps is called machine epsilon. This is the largest relative error possible when given a base.

ϵ=(β2)βp\epsilon = (\frac{\beta}{2})\beta^{-p}

Larger precision p, as expected, implies smaller relative error/machine epsilon. We also notice that 0.5 ulps is bounded by machine epsilon and (12)βp(\frac{1}{2})\beta^{-p} . These bounds have a factor of β\beta which we call wobble.

Yeahhh... wobble baby wobble baby...

BTW, machine epsilon is such a cool name. I'm not judging if you name your dog or child machine epsilon.

Relative errors with machine epsilon

Remember machine epsilon was the upper bound for rounding errors or 0.5 ulps. So if we actually got a relative error much lower, we can represent the relative error as a ratio of the machine epsilon like this:

rel. error=kϵ\text{rel. error} = k * \epsilon

Let's do an example. If I had the number 3.14159 to represent with β=10\beta=10 and p = 3, I would have to round to 3.14. This would have an absolute error of .00159 or 0.159 ulps. For relative error, I do 0.00159/3.141590.00159/3.14159 which leads me to a relative error of 0.0005.

Now, to find the ratio, we must find the machine epsilon:

ϵ=(β2)βp\epsilon = (\frac{\beta}{2})\beta^{-p}

=5103=0.005= 5 * 10^{-3} = 0.005

So... the ratio is:

k=rel. error/ϵ=0.0005/0.005=0.1k = \text{rel. error}/\epsilon = 0.0005/0.005 = 0.1

So we say that the relative error is 0.1ϵ0.1\epsilon .

The Wobble

Get ready for things to wobble. First let me show you how ulps and relative error react to each other.

Using 1.0 to represent 1.04 in decimal, has an error of 0.4 ulps and relative error of 0.038. The machine epsilon is 0.05 which makes the relative error 0.76ϵ0.76\epsilon .

Great! Hopefully this made sense so far.

Now, let's multiply our number by let's say 8. The actual number would be 8.32 while the calculated number would be 8.0. This has 3.2 ulps which is 8 times larger than before! However, our relative error is still 0.32/8.32=0.0380.32/8.32 = 0.038 which is the same as 0.76ϵ0.76\epsilon .

Whoa! Our ulps increased, but our relative error was the same?

Yep. It turns out whenever you have a fixed relative error, you're ulps can wobble by β\beta .

On the other hand, whenever we have a fixed ulps (like we showed earlier with 0.5 ulps), the relative error had bounds which showed it can also wobble by β\beta .

So, smaller the β\beta , smaller the wobble or smaller the error bounds! Using binary, can significantly reduce our error.

Contaminated digits

We now know that ulps and relative error's ratio k vary from each other by a factor of β\beta , the wobble. As a result, we can estimate the number of contaminated digits (the number of incorrect digits from the correct representation of the number).

contaminated digitslogβn\text{contaminated digits} \approx \log_{\beta}{n}

n is the number of ulps. n can also mean k, the ratio between the relative error and ϵ\epsilon . It can mean either because of the wobble factor.

So if I had a number in decimal, 3.10 with p=3 and it was trying to represent 3.1415, it would have an error of 4.15 ulps. The contaminated digits would be roughly log104.15\log_{10}{4.15} which is roughly 0.61804809 digits.

LOL we can't have partial digits! We'll see that when pigs fly.

Visually looking, we can see that it is wrong in 1 digit, the last one, which is pretty close to what we got from our calculation.

Guard digits

Let's subtract 2 values when β=10\beta = 10 and p=3.

x=1.01100x = 1.01 * 10^{0}

y=9.93101y = 9.93 * 10^{-1}

xy=1.010.99=0.02x - y = 1.01 - 0.99 = 0.02

It becomes 0.99 and not 0.993 because we had to lose some data with p=3 so that they could be subtracted from each other at the same βe\beta^{e} .

As you know, the actual answer 0.017, but the answer ended up being 0.02. So 2.001022.00 * 10^{-2} and 1.701021.70 * 10^{-2} have an error of 30 ulps!

The relative error from this kind of subtraction is bounded by β1\beta - 1 . Let me show you why.

If x=1.00...00,y=ρ.ρρ...ρρβ1x=1.00...00, y=\rho.\rho\rho...\rho\rho * \beta^{-1} where ρ=β1\rho=\beta-1 . (ex. β=10,ρ=9\beta=10, \rho=9 ). (x and y have p digits.)

If I subtracted them, I should get the actual answer of 1βp1*\beta^{-p} , but because we shift y to the right and lose a digit, we end up getting 1βp+11*\beta^{-p+1} .

absolute err.=βpβp+1=βp(1β)\text{absolute err.} = |\beta^{-p} - \beta{-p+1}| = |\beta^{-p}(1 - \beta)|

relative err.=abs. error/z\text{relative err.} = \text{abs. error}/z

=βp(1β)βp=β1= \frac{|\beta^{-p}(1 - \beta)|}{\beta^{-p}} = \beta-1

If our β=2\beta=2 that would make our relative error 1. In terms of ϵ\epsilon , it would mean 1=kϵ1 = k*\epsilon and so k=1/ϵk = 1/\epsilon . The contaminated digits would be log21/ϵ=log22p=plog_2{1/\epsilon}=log_2{2^{p}}=p . If p digits are contaminated, all of them are contaminated.

Was there a way we could have avoided some of this error? Okay, fine, there iss.... otherwise I wouldn't have written about this section...

Just add a temporary extra digit. And let's call it a ... wait for it ... a guard digit. suprise surprise

x=1.01100x = 1.01 * 10^{0}

y=9.93101y = 9.93 * 10^{-1}

xy=1.0100.993=0.017x - y = 1.010 - 0.993 = 0.017

Now we have no error now. This is now a bit better than before.

Turns out the guard digit bounds the relative error to 2ϵ2\epsilon . I'm lazy but if someone in the comments asks, I'll figure out why and try to explain it (but its in the linked blog).

Benign and Catastrophic Cancellation

When we try to subtract two really close numbers, many of the digits cancel out and become 0. We call this cancellation. Sometimes cancellation can be catastrophic or benign.

Sometimes, when we do subtraction, there are often errors on the later, far right digits (the least significant digits) after rounding the value or after prior operations. The more accurate digits are at the front (the most significant digits). While the more significant digits at the front cancel out, the lesser accurate lower significant digits would have to subtract and produce an even more inaccurate value. (Like when you calculate the determinant b24acb^2 - 4ac ).

The catastrophic cancellation just exposes the rounding errors from prior operations.

Benign cancellation happens when you subtract numbers that have no rounding errors.

IEEE Standard

So the IEEE standard is a set of rules that many systems follow to ensure consistency. There are two IEEE standards that are followed: IEEE 754 and IEEE 854. They both support smaller and larger floating points called single precision and double precision.

IEEE 754

The standard allows β=2\beta=2 . It has single bit precision (p=24) and double bit precision (p=53). It also discusses how the bits should be laid out.

In fact, here is a cool table that shows how IEEE 754 sets all its floating point parameters.

IEEE 754 Precision Parameters

Exponents are represented with a sign/magnitude split. One bit is used for the sign of the exponent. The remaining bits for the exponent are used to represent its magnitude. Two's complement is another approach but is not used by either IEEE standard.

In fact, here is exactly how the bits are laid out to represent different kinds of values. To represent 0, you have to use emin1e_{\text{min}}-1 . Infinity is emax+1e_{\text{max}}+1 without the fractional section 0ed out. NaN is another type (like when 0 is divided by 0, or infinities are added). NaN is represented the same as infinity but with the fraction section set.

The fractional section is the digits after the first digit (also called the significand.

Special Values

IEEE 854

On the other hand, this standard allows β=2\beta=2 or β=10\beta=10 . However, there is no rules about how the bits should be laid out for double and single precision.

It allows base 10 because it is also the standard way humans count. Base 2 is also included because of the low wobble.

The Conclusion

Okay, I kinda rushed the last section. But overall, I wanted to say floating points can have a lot of imprecision. If you can avoid them, use integers instead.

In the case that you do use them, try to limit your wobbles and avoid catastrophic cancellations (there are ways you could do it sometimes by rearranging formulas).

Try to read the original blog yourself. This blog you are reading right now is a summary of a fraction of the original blog by David Goldberg.

But as always, drop your questions and comments. And I'm out for now...

Peace
-absterdabster

The end

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay