# Fundamentals of Numerical Stability Bastian Heinlein Updated on ・9 min read

Mathematical operations are the basis of most algorithms. We increment counter variables, calculate percentages or measure distances between objects. And it just works.

Until it doesn't. Of course, there are obvious errors like reusing a variable without resetting its value or using the wrong formula for a problem. However, I'd like to shine light on a completely different issue: Numerical stability of algorithms.

Sounds fancy and it is kinda. Because this is such a broad and potentially difficult topic, we'll only look at a simple example.

I watched this great YouTube video the other day, which covered an 1869 MIT maths entrance exam. In exercise 5 this expression should be simplified:

$\frac{ \frac{a+b}{a-b} + \frac{a-b}{a+b} }{ \frac{a+b}{a-b} - \frac{a-b}{a+b} }$

# The Programmer's Solution

As a programmer you could say: Wait, why would I do that? I have a super-fast computer. Let's just write a function to calculate this expression. We even use a simplification to make life simpler for the computer. We set: $c = a + b$ and $d = a - b$ . Now we got this:

$\frac{ \frac{c}{d} + \frac{d}{c} }{ \frac{c}{d} - \frac{d}{c} }$

Now the calculation can be completed in only two additions, two subtractions and five divisions. This is profoundly simpler than the initial five additions, five subtractions and five divisions. Note that we didn't change the expression's form.

We could go even further and find substitutes for $\frac{c}{d}$ and $\frac{d}{c}$ , but let's say it's good enough.

Now, we are ready to write our script:

   function mit1869_naive(a, b) { const c = a+b; const d = a-b; return ( (c/d) + (d/c) )/( (c/d) - (d/c) ); } mit1869_naive(1, 2); 

Feel free to play around with the function a bit. You'll notice that you'll get NaN for a=1 and b=1 because d is zero if a == b. Other than that, everything is fine, isn't it?

At least for now, we don't have a reason to assume that this implementation might be wrong, because all we have done is copying the formula.

# Simplifying the expression

We could call it a day and stop here. But how about actually solving the problem?

We'll use the already simplified version, form the main denominator in numerator and denominator, which we can then reduce. :

$\frac{ \frac{c}{d} + \frac{d}{c} }{ \frac{c}{d} - \frac{d}{c} } = \frac{ \frac{c^2 + d^2}{cd} }{ \frac{c^2 - d^2}{cd} } = \frac{ c^2 + d^2 }{ c^2 - d^2 }$

Because we can't simplify this expression anymore, we'll resubstitute c and d. Before we do that, we use the first and second binomial formula:
$c^2 = (a+b)^2 = a^2 + 2ab + b^2$

$d^2 = (a-b)^2 = a^2 - 2ab + b^2$

Now we resubstitute:

$\frac{ c^2 + d^2 }{ c^2 - d^2 } = \frac{ a^2 + 2ab + b^2 + a^2 - 2ab + b^2 }{ a^2 + 2ab + b^2 - a^2 - 2ab + b^2 }$

The last step is to simplify this expression and we get:

$\frac{ 2(a^2 + b^2) }{ 4ab } = \frac{ a^2 + b^2 }{ 2ab }$

We did it! If you check the video, you'll see the presenter found the same solution although we used a slightly different way.

However, is this solution really better? It requires one addition, four multiplications and one division. If we assume that every operation takes the same time (which isn't necessarily true), we have 6 operations instead of 9. However, if we'd have simplified mit1869_naive even more, we could've saved two divisions and end up with 7 operations. Not a big difference between those two.

But let's say, we implement this function as well:

   function mit1869_simplified(a, b) { return ( a*a + b*b )/( 2*a*b ); } mit1869_simplified(1, 2); 

# Comparing the two functions.

Looks fine. We can play around with some values and both functions will give us the same outputs. Ok, mit1869_naive returns 1.2500000000000002 instead of 1.25 for a=1 and b=2, but that doesn't seem like a relevant difference.

However, testers and engineers alike are more interested in extreme examples. So let's set a=0.1 and b=100. We get:

• mit1869_naive(0.1, 100): 500.0005000000141
• mit1869_simplified(0.1, 100): 500.0005

Again, the naive version seems to have some problems in the last few digits but nothing dangerous. However, the number of valid digits shrank quite a bit to 10 from 15. But two data points are not a trend. So, let's try more values.

a b naive simplified
0.001 1,000 50000.00000509277 50000.000004999994
0.0001 10,000 4999999.9984685425 5000000.00000005
0.00001 100,000 499999972.5076038 500000000
0.000001 1,000,000 50000134641.22209 50000000000
1e-6 1e7 4998445757347.942 5000000000000
1e-7 1e8 486875635391405 500000000000000
1e-8 1e8 3602879701896397 5000000000000000
1e-8 1e9 -Infinity 50000000000000000

As we can see, starting with a= 0.00001 and b= 100,000 we don't have any valid decimal places at all. And if we use really extreme values, we get invalid results.

However, our two functions are mathematically equal. What happened?

# Why do these function output different values?

The problem is: Two functions that are equal on paper don't need to be equal if you implement them in a computer because computers have only limited memory.

Hence, a computer cannot store real numbers exactly (sometimes it does, but you can't count on that, even 0.1 won't be stored precisely). Usually that is not a problem, because for floating point numbers like defined in IEEE-754 we have a guarantee: The relative error will always be smaller than or equal to $2^{-m}$ , where m is natural number depending on the amount of bits used. For a 32 bit IEEE-754 number, m would be 24.

Let's look at some examples where x is the number we want to represent:

x order of magnitude max absolute error
1 one 58*1e-9
100 hundred 5.9*1e-6
1e6 million 0.0596
1e9 billion 59.6
1e12 trillion 59.6*1e3

It's easy to see that you can be off quite a bit, but with respect to x the error will always be small. If you have one million dollars, usually you won't care about \$0.06.

But there are situations in which you care. If we have a look at our example again, we could ask which implementation provides the correct or at least better results. To do that, we need the actual value.

Sadly, we cannot answer this question using another program . Instead, we'll have to tackle the problem with pen and paper. Let's look at a=1e-8 and b=1e9.

At first we'll look at the simplified version:

$\frac{ a^2 + b^2 }{ 2ab } = \frac{ (10^{-8})^2 + (10^9)^2 }{ 2 \cdot 10^{-8} \cdot 10^{9} } = \frac{ 10^{-16} + 10^{18} }{ 2 \cdot 10^{1} }$

We simply inserted the values and simplified a bit. Now, we'll approximate $10^{-16} \approx 0$ . This is a valid assumption because $10^{18}$ is much larger. In fact, a computer would do the same or something similar if you add these two numbers (feel free to insert console.log statements to test this). Now we get:

$\frac{ 10^{18} }{ 2 \cdot 10^{1} } = \frac{ 10^{17} }{ 2 } = 50,000,000,000,000,000$

This is exactly the value the second JS script calculated as well.

After this success, look at the naive implementation. Again, we'll assume $10^{-8} \approx 0$ if we add or subtract a and b because a is much smaller.

$c = a+b \approx b = 10^9$

$d = a-b \approx b = -10^9$

Now we resubstitute:

$\frac{ \frac{c}{d} + \frac{d}{c} }{ \frac{c}{d} - \frac{d}{c} } \approx \frac{ \frac{10^9}{-10^9} + \frac{-10^9}{10^9} }{ \frac{10^9}{-10^9} - \frac{-10^9}{10^9} } = \frac{ -1 - 1 }{ -1 + 1 } = \frac{ -2 }{ 0 }$

Now you probably will say "But you cannot divide by zero!". And that's correct, however a computer does it. It simply assumes, that 0 is a very small number and if you divide something by something very small, you'll get something very big, in this case infinity. But because you divided a negative number by 0, you get -infinity.

Without the approximations, both versions would've delivered the same results. But as explained above, because there's only a finite amount of memory used for each number, the computer has to make this approximation. On a side note: This means that our second approximation wasn't valid from a mathematical standpoint as the result was obviously incorrect. Hence, approximations must always look at where the resulting values will be used and wether small differences will be relevant.

# Generalizing the effects

## There's an upper bound for the maximum relative error.

If you try to represent a real number x in a computer with a floating point number x', the following upper bound for the relative error exists :

$\frac{|x'-x|}{|x|} \leq 2^{-m} \text{ where } m \in \N$

The value of m depends on the implementation and wether a 32 bit or 64 bit system is used.

This has an important consequence: If we add an (in magnitude) small to an (in magnitude) large number, it's possible that we loose information about the small number.

## Different implementations of the same algorithm can cause different results.

This is a direct consequence of the rule above. Sometimes, a specific algorithm is always better than another, sometimes it might depend on the input values.

Definition: Condition of a problem A problem is well-conditioned, if small changes in the input values don't cause very large changes in the result. On the other hand, a problem is ill-conditioned if small changes in the input values do cause large changes in the result . (In fact you could quantify this, but that's a large topic for itself.)

Definition: Stable algorithm An algorithm is stable, if the accumulated error resulting from rounding errors isn't larger than what we would expect for this problem with its given condition and errors in input values. 

Example 1: If we have a well-conditioned problem and only small input variations, a stable algorithm would cause only small output variations.
Example 2: If we have an ill-conditioned problem, even a stable algorithm couldn't ensure that small input variations will only cause small output variations because the big output variations are inherent to the problem!
Example 3: This article looked at one problem. However, we found two algorithms to solve it. One is stable, one is not.

# Applying our gained knowledge

While this only scratches this important topic of computing, you still can use it in your programs. Here are some rules of thumb that'll be useful quite often:

1. Usually, it's better to use multiplications instead of sums. Especially if you'd add very large positive and very large negative numbers and expect an in magnitude small number as result.
2. Divisions by zero should be avoided. In some programming languages exceptions will be thrown, in others it'll result in infinite or NaN. Be careful if you divide by numbers that are almost zero.
3. Rounding errors can add up. Hence, it's better to use fresh values whenever possible. Read about a worst-case failure here.
4. The associative property of addition is not guaranteed on a computer. Often, it's preferable to add numbers in the same order of magnitude as long as possible to approximate this property as good as possible.

Disclaimer: These are rules of thumb. They can be wrong and in some instances, they will be wrong. Critical systems  always should be developed by professionals and not be based on articles from the internet, not even this one.   