Víctor Oliva

Posted on May 22, 2023 • Edited on Jun 17, 2023

Solving the "floating point precision" problem with... floats?

#floats #arithmetic #rust #javascript

The problem

There's a known problem when working with numbers on computers:

0.1 + 0.2 != 0.3

It usually shows up when you less expect it, often when you get a result which breaks your beautiful UI with a big 0.30000000000000004 showing up.

There are many solutions that deal with this issue, but I would like to take a step back, understand why it happens, and look if it can be solved in a different way.

There's a common misconception that this problem is due to the fact that computers represent decimal numbers in a floating point format, and that it would be easily solved if computers used fixed point instead.

However, that's not exactly right. In fact, fixed point numbers usually have less precision than floating point numbers, so in most of the cases it would even have a bigger error.

To understand this better, first we need to remember what decimals are.

Decimals in depth

In base 10, when you have a number for instance 12.34, what this is actually representing is:

 1    2   .   3     4
10^1 10^0 . 10^-1 10^-2

(1 * 10) + (2 * 1) + (3 / 10) + (4 / 100) =
   10    +    2    +   0.3    +    0.04

10 + 2 + 0.3 + 0.04 = 12.34

The decimal representation only means a sum of values, where every position is just a different factor of the base it's using, in this case base 10.

Starting from the decimal separator, to the left side they keep raising 10ⁿ - The first value is the units (10⁰), the second value is the tens (10¹), the third value is the hundreds (10²) and so on.

And from the decimal separator to the right side, it's the opposite: it divides them by each power of ten. So the first digit is multiplied by 1/10, the second digit is multiplied by 1/100, and so on.

This definition can be used to represent decimal numbers in any other base. As an example, let's do the same but on base 2 for the number 5.3125.

To get the decimal representation in any base, we can use an algorithm that's really simple: to get the next decimal, you multiply the value by the base, and extract the unit value from the result. The algorithm ends when the value reaches 0.

5.3125 = 5 + 0.3125;

2^-1 -> 0.3125 * 2 = 0.625 -> [0]
2^-2 -> 0.625 * 2  = 1.25  -> [1]
2^-3 -> 0.25 * 2   = 0.5   -> [0]
2^-4 -> 0.5 * 2    = 1.0   -> [1]
// We're done, no more decimals

2^2 2^1 2^0 . 2^-1 2^-2 2^-3 2^-4
 1   0   1  .  0    1    0    1

5.3125 -> 101.0101 = 4 + 1 + 1/4 + 1/16

5.3125 is represented as 101.0101 in binary, and the result is exact.

However, a problem arises if we try to represent 0.3 in binary, because we find ourselves in a loop when running the algorithm above:

2^-1 -> 0.3 * 2 = 0.6 -> [0]
2^-2 -> 0.6 * 2 = 1.2 -> [1]
2^-3 -> 0.2 * 2 = 0.4 -> [0]
2^-4 -> 0.4 * 2 = 0.8 -> [0]
2^-5 -> 0.8 * 2 = 1.6 -> [1]
// This is now repeating from 2^-2
2^-6 -> 0.6 * 2 = 1.2 -> [1]
2^-7 -> 0.2 * 2 = 0.4 -> [0]
...
// And it will keep going on forever, the value will never be 0.

0.3 = 1/4 + 1/32 + 1/64 + 1/512 + ...

2^0 . 2^-1 2^-2 2^-3 2^-4 2^-5 2^-6 2^-7 ...
0   .  0     1    0    0    1    1   0 ...

So 0.3 represented in base 2 is 0.01001100110011001... with 1001 recurring. And now we have a problem, because double precision floats have a mantissa of 52 bits, which is really long, but it can only fit the first 52 digits out of the infinitely many others.

Note how the amount of decimals or the precision of floats doesn't have a play in here. 5.3125 has more decimal digits than 0.3 in base 10, but 5.3125 can be represented exactly in base 2 (it only has 4 binary decimals) whereas 0.3 can't because it has infinite binary decimals.

You might be asking "If 0.3 isn't representable as a binary number, how come computers can still print it after storing it to a variable?" As part of transforming a number to a string they just apply some rounding up to some decimals, which depends on the implementation. So even though 0.3 becomes 0.2999999999999999888977697537... when represented as a double precision binary float, when it's printed out it's rounded to "0.3".

It happens that when adding 0.1 + 0.2 it's adding two slightly larger numbers (because they can't be represented accurately either), and that results with a value slightly larger than 0.3 that doesn't get rounded to "0.3" but to 0.30000000000000004.

Our base 10 system also has the same problem for a lot of other numbers. Let's say we want to transform 0.1 in base 7 to base 10. 0.1 in base 7 just means the result of the fraction 1/7. We can apply the same algorithm to get the decimal representation but in our base:

10^-1 -> 1/7 * 10 = 10/7 = 1 + 3/7 -> [1]
10^-2 -> 3/7 * 10 = 30/7 = 4 + 2/7 -> [4]
10^-3 -> 2/7 * 10 = 20/7 = 2 + 6/7 -> [2]
10^-4 -> 6/7 * 10 = 60/7 = 8 + 4/7 -> [8]
10^-5 -> 4/7 * 10 = 40/7 = 5 + 5/7 -> [5]
10^-6 -> 5/7 * 10 = 50/7 = 7 + 1/7 -> [7]
// This is now repeating from 10^-1
10^-7 -> 1/7 * 10 = 10/7 = 1 + 3/7 -> [1]
10^-8 -> 3/7 * 10 = 30/7 = 4 + 2/7 -> [4]
...
// And it will keep going on forever, the value will never be 0.

1/7 = 1/10 + 4/100 + 2/1000 + 8/10000 + ...

10^0 . 10^-1 10^-2 10^-3 10^-4 10^-5 10^-6 10^-7 ...
0    .   1     4     2     8     5     7     1 ...

So in a parallel universe where we use base 7, we would face the issue where we can't represent '0.1' in base 10, because that's 1/7 = 0.142857142857142857...

This raises a question: What makes a number have a finite amount of decimals for a given base? The answer is any rational number that has a denominator composed with factors from the base.

As an example, in base 10, all numbers that have a denominator composed only by 2 and 5 (which are the factors of 10) will be represented exactly: 4/10 = 0.4, 1/5 = 0.2, 1/25 = 0.04 and so on. But any that has any other prime factor in the denominator will have infinite decimals: 4/30 = 0.1333..., 8/18 = 0.444...

For this reason, 1/10, 1/5 and 3/10 have recurring decimals in base 2: They all have 5 as factor on the denominator. 5.3125 on the other hand doesn't have any recurring decimal, because it's 85 / 16, and 16 is a power of 2.

So the problem comes exclusively from this fact. When working in binary, the only numbers that can be represented exactly in decimal form are the ones which are divided by powers of 2. In base 10 we can represent numbers which are also divided by powers of five, and these just can't get accurately transformed to binary. When you have recurring decimals, at some point you need to "cut" and all of that precision gets lost.

But this is completely independent from floating point vs fixed point. Floating point only means that the decimal point can be moved left or right, keeping the original value by multiplying an exponent. In essence, floating point numbers store the number in scientific notation.

Fixed point on the other hand it reserves some space for the integer part, and some space for the decimal part, but that doesn't solve this problem.

Let's work in base 10 to make this more familiar. And let's imagine our variable only has space for 8 characters.

With 8 characters, I could define a floating point format where I have 1 character to store the exponent and 7 to store the significant. With this I can represent the following numbers:

// Floating point with 7 significant digits and 1 exponent digit

1/3
= 0.3333333
= 3.333333 * 10^-1
Stored as [3,3,3,3,3,3,3,-1]

1/7
= 0.1428571
= 1.428571 * 10^-1
Stored as [1,4,2,8,5,7,1,-1]

On the other hand, a usual practice when working with fixed point is to have half the bits for the integer part, and half the bits for the fractional part. In this case:

// Fixed point with 4 integer digits and 4 decimal digits
1/3

= 0.3333333
Stored as [0,0,0,0,3,3,3,3]

1/7
= 0.1428571
Stored as [0,0,0,0,1,4,2,8]

As you can see, by moving from floating point to fixed point, we lost 3 digits of precision.

I know, you could reserve more space for the decimal part on the fixed point format, but then you lose range on the integer part, whereas on floating point with the current format it can go all the way from 10^-4 up to 10^5.

Each representation has pros and cons, but the one that's most commonly used is floating points, and both of them have the same original problem.

Common solutions

Multiplying by 100

A common way of fixing this is to avoid decimals altogether, usually multiplying these numbers by powers of 10. For instance, applications that work with numbers that are usually divided into hundreds (e.g. currencies with cents as division) store numeric values as "number of cents", essentially multiplying every value by 100. This is actually fixed point arithmetic, but applied before parsing it into binary, reserving 2 base-10 digits as decimal part.

However, when doing this you lose on flexibility. It works really well if all you have to do is add numbers and you know all of them will have a specific amount of digits on base 10.

But this won't work if you deal with numbers with an arbitrary amount of decimals, or if you have to divide, where you can very easily lose precision. And it also makes multiplication trickier, since you have to shift the value to get the correct result (often resulting in more loss of precision).

Increasing amount of bits + rounding

Another common way of solving this is by using a really big amount of decimal places, coupled with some rounding to have less chance of having this error show up.

This is used by many libraries such as BigDecimal, BigNumber, etc. and most of the time it works. However, it's still not guaranteed, as most of these libraries surface the same recurring decimal issue if you invert a number twice.

And sometimes you need to apply some custom rounding which can get wrong values (e.g. doing a Rounding.CEIL on a number that became 0.30000...00001 will give you 0.31)

Lastly, performance is also often impacted, because often these libraries internally use BigInts which can become slow to operate as the number keeps growing larger.

Rational numbers

Another alternative is to use a library that represents numbers as rational numbers. These store two integer values a/b and then apply all operations by using basic arithmetic rules on rational numbers.

However, these ones have the problem that as the numbers keep getting more and more complex, both integers 'a' and 'b' keep growing and growing, which can easily cause an overflow. Some libraries mitigate this by using BigInt, but then again they become expensive to do computations (and they have to calculate GCD every time you do any operation to keep them as small as possible).

Looking at alternatives

Is there any way we can avoid the original problem without using bigints, numbers that can grow indefinitely or doing some rounding?

Let's focus on what makes a number representable in base 2 without recurring decimals. Let's call these safe numbers in short.

Edit: @programmerjake pointed out that the proper mathematical name of these numbers are Dyadic rationals: These are rational numbers that the denominator is a power of two, which is what's needed for a decimal number to not have recurring decimals when represented in base 2.

If all we have are safe numbers, we can operate with them and we will get back another safe number:

101.0101 -> 5.3125
+ 1.1011 -> 1.6875
111.0000 -> 7

5.3125 + 1.6875 = 7

101.0101 -> 5.3125
- 1.1011 -> 1.6875
 11.1010 -> 3.625

5.3125 - 1.6875 = 3.625


     101.0101 -> 5.3125
     * 1.1011 -> 1.6875
     101 0101
    1010 101
   00000 00
  101010 1
 1010101
10001111 0111
1000.11110111 -> 8.96484375

5.3125 * 1.6875 = 8.96484375

All of the results are exact, and this makes sense because when adding or subtracting we're just going "column by column", which will never add recurring decimals. And multiplication only uses addition as well, so that won't create recurring decimals.

The only basic arithmetic operation that can give recurring decimals is division. 1 and 3 are both safe numbers, but if you divide 1/3 it gives back 0.333... which has recurring decimals.

The other big limitation of only using safe numbers is that they are discrete, they can't represent all the numbers that rationals do.

Looks like safe numbers are the key to the solution, but is there a way we can solve these two limitations?

What if...

What if we delay the execution of divisions as much as possible, while still using "safe numbers"? Essentially it's like using rational numbers but where both the numerator and denominator can be decimals too. All we need to do is just store 2 safe numbers a and b which together represent the division a/b.

This should actually solve both limitations safe numbers have by themselves. The division operation not being safe is gone because we don't need to divide, and we can represent every rational number because they are also a division of two numbers, and integers are also safe numbers.

For example, 3.35 is not representable in binary in exact form. But we can transform this number to a pair a / b = 5.234375 / 1.5625 = 3.35 where both a and b are safe numbers. They don't have any recurring decimal when represented in base 2: 5.234375 -> 101.001111 and 1.5625 -> 1.1001.

The main advantage over rational numbers is that we can keep both a and b relatively small. Any loss of precision will be because all available significant digits have been consumed and that loss will be in the least significant digits due to the nature of floats. On the other hand, in the case of rational numbers those integers can overflow which would result in a completely different number than the one intended.

And because we're in the land of "safe numbers", we can add, subtract and multiply by using basic fraction arithmetic while staying in this land. And the only operation which we couldn't do, division, we can now just easily invert a number (swap a and b) and multiply it.

Comparison

I've implemented this in Rust and Typescript, you can find the repo here. I've called it SafeDecimal. I won't go in detail on the implementation, but let's compare it with the other usual approaches.

Comparing with BigDecimal

BigDecimal is a rust library for Arbitrary-precision decimal numbers. Internally uses a BigInt to store as many decimals as needed.

The first advantage SafeDecimal has against this kind of library is from a bug I faced when I divided a number (currency) by a factor (exchange rate) which was inverted in another place of the code.

Let's imagine 0.6 as a rate, and 9 as a currency value:

let rate = SafeDecimal::from(0.6).inv().unwrap();
let value = SafeDecimal::from(9.0) / rate;
println!("SafeDecimal {}", value.to_string());
// Prints SafeDecimal 5.4

let rate = BigDecimal::from_str("0.6").unwrap().inverse();
let value = BigDecimal::from_str("9").unwrap() / rate;
println!("BigDecimal {}", value.to_string());
// Prints BigDecimal 5.3999999999999999999999999999999999999

This issue (well, another one in a widely used JS library) is what brought me to dig deeper into these "arbitrary precision" libraries and understand why this happens. I thought they were using rational numbers, but that's not the case. And this is what led me to think about how to solve this problem.

Next, on performance, SafeDecimal also has an edge. For small numbers multiplication it's already out by an order of magnitude - In pseudocode doing the following:

let value = 0.1;
let multiplier = 1.001;

for _ in 0..1_000_000 {
   value * multiplier;
}

SafeDecimal takes ~40ms whereas BigDecimal takes ~700ms

But if we let the number grow, by multiplying itself on every iteration

let mut value = 0.1;
let multiplier = 1.001;

for _ in 0..100_000 { // I had to take out one 0 or it took too long
   value = value * multiplier;
}

SafeDecimal takes ~4ms whereas BigDecimal takes ~609ms, now two orders of magnitude. And it makes sense because BigDecimal is multiplying one BigInt with another that is exponentially getting larger and larger.

And the result of this operation is not a massive number, but it does have A LOT of decimals. After these 100.000 operations, the output from both is:

SafeDecimal: 2191.668133907863
 BigDecimal: 2191.66813390784270437847386791744251604426416... // 29.960 more decimals

This shows the trade-off for SafeDecimal: It doesn't have arbitrary precision. In this example it has a precision error at the 11th decimal.

And lastly addition shows only a small advantage:

let mut value = 0.1;
let increment = 1.001;

for _ in 0..1_000_000 {
   value = value + increment;
}

SafeDecimal takes ~40ms whereas BigDecimal takes ~64ms

Comparing with Rational

The other approach uses rationals instead. For this example, I'm using num_rational.

The advantage when using floats is that they don't overflow as integers do. The following loop panics with an overflow when using Rational64 (which is using 2 i64 numbers) after just 5 iterations:

let mut value = 0.1;
let multiplier = 1.001;

for _ in 0..10 {
   value = value * multiplier;
}

When running in release mode it doesn't panic, but the overflow leaves the result completely unusable. If we print out intermediate values:

0. Ratio { numer: 1001, denom: 10000 } = 0.1001
1. Ratio { numer: 1002001, denom: 10000000 } = 0.1002001
2. Ratio { numer: 1003003001, denom: 10000000000 } = 0.1003003001
3. Ratio { numer: 1004006004001, denom: 10000000000000 } = 0.1004006004001
4. Ratio { numer: 1005010010005001, denom: 10000000000000000 } = 0.1005010010005001
5. Ratio { numer: -1006015020015006001, denom: 8446744073709551616 } = -0.11910092353173368 // Overflow!

So we can't benchmark it with growing numbers because they quickly overflow. However, if we repeat the same multiplication 1.000.000 times, then we get that SafeDecimal takes ~70ms and Rational takes ~40ms, so in this case Rational with integers is usually faster.

Closing remarks

This has been really interesting to explore, and I think the results look promising. It feels like it's a really good option for those applications where you don't need arbitrary precision, but you want to avoid running into the recurring decimal bug, while also having good performance.

The implementation I've made it's not production-ready. I'd like to add more (and better) tests, and there are a few edge cases I'd like to define the behaviour (dealing with special float values such as NaN, Infinity), but feel free to have a look around and I'm also open to feedback.

I'd also like to know if this was already explored by somebody else - I tried searching references on this but I couldn't find any.

Acknowledgments

@programmerjake: Corrections, Dyadic rationals
@josepot: Corrections
@bdellegrazie: Corrections
@riko: Corrections
@bruceharris: Corrections

Top comments (9)

Vladimir Shardinov • May 28 '23

Good article about 0.1 + 0.2 != 0.3 and great idea!

Just correct the typos in the article:

Instead of "for _ in 0..100_000" - should be like in the tests of your repo: "for _ in 0..10_000"
Instead of "+ 1.1011 -> 1.6875" in the 2nd case (for subtraction) - of course it should be "- 1.1011 -> 1.6875".

Also, I would add a comparison with f64 in the multiplication test:
for _ in 0..10_000 { value = value * multiplier; } f64: 2191.6681339054253 - the error for f64 is already in the 9th character instead of 11 for SafeDecimal!

Víctor Oliva • May 31 '23

Yep, I will update, thank you very much!

Simon Farnsworth • May 26 '23

Have you ever encountered the mathematical topic of "Numerical Analysis" before? It's the study of algorithms that use approximations instead of perfectly exact symbolic manipulation, and it has a wealth of ideas that help with floating point.

The core of floating point is that we're approximating the set of all real numbers (an infinite set) with a finite set of floating point numbers. Because we're working with approximations, all of our algorithms should be amenable to numerical analysis, which allows you to bound the final error to a known limit.

For example, using floating point (or another approximation's) native addition operation on a list of numbers will result in an error in the final sum that's proportional to the length of the list - so a + b has an error bound that's half that of a + b + c + d. Kahan summation is an alternative algorithm for summing a list of approximations, and its error bound is the square root of the length of the list - so a + b now has an error bound that's half that of a + b + c + d + e + f + g + h.

The whole field of numerical analysis (which I picked up from a book called "Numerical Analysis in FORTRAN", which tells you how long ago it was first interesting to me!) is fascinating if you're concerned by the errors introduced by floating point numbers - because the mathematics we needed to handle floating point without error goes back to the Babylonians. It can tell you what the bounds on error are for any calculation you want to perform with floats - or tell you that the bounds are wide open, and the calculation is meaningless.

And once you know what the error bounds are, you can determine if they're acceptable in this case; you might be happy if the analysis tells you that your location measurement to 1 meter accuracy, combined with a speed measurement to 1 kph accuracy and braking force accurate to 1 newton, results in an estimated error in your calculated stopping point of 2 meters. You'd be less happy if your financial algorithm had an estimated error in predicted future value of 150% of the investment value, but in some cases that's usable (in others, it's not). But if the error bounds on the stopping point are 1,000 meters, or the predicted future value is 10,000,000% of the investment value, the algorithm is useless - and numerical analysis tells you if that's the case, even if it gives good answers for some inputs.

Avenir • May 31 '23

hello , the code

0.3 = 1/2 + 1/32 + 1/64 + 1/512 + ...
2^0 . 2^-1 2^-2 2^-3 2^-4 2^-5 2^-6 2^-7 ...
0   .  0     1    0    0    1    1   0 ...

the 0.3 maybe equals 0.3 = 1/4 + 1/32 + 1/64 + 1/512 + ...

Víctor Oliva • May 31 '23

Woops, you are right! I will correct, thank you

Mike Talbot ⭐ • May 23 '23

Wow, now that is very very interesting.

Mike Bush • Jun 3 '23

I enjoyed the article and particularly the insight that the source of the problem was that fractions can be represented in a base2 format finitely only if their demoninator is a factor of 2 whereas in base 10 they can be represented finitely if they are factored by 2 and 5.

Then you didn't make the leap that I thought you were going to, that all that's needed to make an arbitrary, finite base10 number perfectly representable in base2 is to multiply it by 5. If the denominator is always 5 then there's no need to store the denominator.

Have you given this a try? It seems like you could get some performance improvements for your library if you really focus on it being just for the base10 usecase and don't try to represent other bases perfectly too.

Víctor Oliva • Jun 4 '23

More or less:

If the denominator is always 5 then there's no need to store the denominator.

The problem is that the denominator isn't always 5: 0.1 = 1/(2*5) which if you multiply by 5 it's 1/2 = 0.5. But 0.01 = 1/(2*2*5*5), so if you multiply by 5 you will still have another 5 remaining on the denominator 1/(2*2*5) = 0.05, which is not representable in binary safely. So you still need to remember how many fives have you multiplied for if you want to convert it to 1/(2*2) = 0.25.

On the implementation I use this trick to parse base10 fractional parts into safe-decimal, so that I don't need to re-implement base10 parsing while keeping the numbers as small as I can: github.com/voliva/safe-decimal/blo...

Denominator is 5^amount_of_decimals, and Numerator is just the parseInt(decimal_part) divided by 2^amount_of_decimals: parseInt is essentially multiplying the fractional part by a power of 10, but I know I can already remove all the twos, so on the denominator I keep it by a power of 5 and the numerator I just shift it.

fruntend • May 29 '23

Сongratulations 🥳! Your article hit the top posts for the week - dev.to/fruntend/top-10-posts-for-f...
Keep it up 👍