Table of Contents
- Introduction
- Normalization and Representation
- Converting Floating Point Decimal to Binary
- Conclusion
Introduction
If you have worked with floating-point numbers in computers, you'll notice they sometimes exhibit certain weird behavior. For example, if you type 0.1 + 0.2
in the Python console, you're going to get 0.30000000000000004
instead of 0.3
. This behavior is mainly due to how computers store floating-point numbers.
Normalization and Representation
For us, 0.1
is just 0.1
, but that is in decimal, and computers know nothing about decimals. All they know is binary. Computers store floating-point numbers by first converting them from floating-point decimal into binary, which we'll take a look at shortly. Then they do what we call normalization.
The idea behind normalization is to create some form of standard. This is because floating-point numbers can be represented in different ways. Take, for example, 0.123
. We can represent this as
or
. These are all valid ways of representing floating-point numbers. To make it standardized, we first came up with what is now known as explicit normalization.
Explicit Normalization
With explicit normalization, we move the radix point of a floating-point binary number to the left-hand side of its most significant 1. For example, given the binary number 10.100
, we'll move the radix point to the left-hand side of the most significant 1, giving 0.10100
. Since we moved the radix point 2 times to the left, we will multiply by
, thus
This is binary, thus we're multiplying by
2
and not10
This allows us to save only the fractional part, 10100
, which is also known as the mantissa, and the exponent, 2
. The values are laid out in memory in the form:
________________________
|sign|exponent|mantissa|
------------------------
sign: represents the sign of the floating-point number, with 0 being positive and 1 being negative
exponent: represents the exponent of 2 after normalization
mantissa: represents the fractional value after the radix point
For simplicity's sake, let's assume we have an 8-bit computer, so we're going to store our floating-point number in 8 bits. We use 1
bit to represent our sign, 4
bits to represent our exponent, and 3
bits to represent our mantissa.
The sign is 0
since the value is positive, and the mantissa will be 10100
. The exponent will be a bit tricky as we can't just convert 2
into binary and save it. This is because the exponent can be a negative number, thus we need to add a bias. We get the bias using
where k
is the number of bits representing the exponent, which for our 8-bit computer is 4. Thus,
, so our exponent will be
, whose binary is 1001
. A representation of 0.10100
in an 8-bit computer will be:
____________
|0|1001|101|
------------
We only store
101
instead of10100
because we only have 3 bits to save our number.
We can then convert this back with the formula:
allows us to get the right sign of the value.
s
is the sign, and thus ifs
is0
, then the expression will evaluate to , but ifs
is1
, then the expression will evaluate to0.M
evaluates to0.101
asM
represents the mantissa.-
evaluates to the exponent, but before saving the exponent, we added a bias, thus we need to subtract the bias: . Thus, the formula becomes
Implicit Normalization
The problem with explicit normalization is the same problem that leads to the sum of 0.1
and 0.2
not being exactly 0.3
. The problem is precision. With floating-point numbers, the more bits we can store, the more precise our floating-point representation. With our 8-bit computer described earlier, we had 3 bits representing the mantissa. This means if we had more than 3 bits, as we did, we could still only save 3 bits, thus losing the remaining bits. However, if instead of moving the radix point to the left-hand side of the most significant 1 bit, we move it to the right-hand side of it, we can save an extra fractional bit.
10.100
will become
and will be represented as:
____________
|0|1000|010|
------------
We can then, during conversion, change the expression 0.M
to 1.M
.
The last bit is 0 so it’s irrelevant in this case but would be important if it was 1.
Thus, implicit normalization allows us to imply there is a 1
and saves an extra bit. One bit may not seem like a lot, but it's the difference between ASCII and UTF-8, it's the difference between 127 and 256, so yeah, for computers, it's a lot.
Converting Floating Point Decimal to Binary
Now that we have some basic understanding of how computers store floating-point numbers, let's try to understand why 0.1 + 0.2 is not particularly equal to 0.3. First, let's convert 0.1 and 0.2 to binary. To convert floating-point numbers to binary, we first convert the value before the radix point to binary and then multiply the fractional part by 2, keeping the value of the integer before the radix point until the value becomes 1. Yeah, that may not have been the best explanation, so let's see an example.
Let's say we want to convert 2.25
to binary:
- Convert the integer before the radix point, which is
2
, to binary:10
. - Convert the fractional part by multiplying it by 2 and keep whatever integer is before the radix point until we get
1
:
0.25 × 2 = 0.5 → 0
0.5 × 2 = 1.0 → 1
Then put the integer values you got from top down behind the radix point. The floating-point 2.25
will be 10.01
.
Now let's convert 0.1
:
- The integer before the radix point is 0, which is the same in binary
- Convert the remaining fraction:
0.1 × 2 = 0.2 → 0
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
...
We can see this goes on and on. This is a recurring binary number, thus 0.1
in binary is 0.0001100110011...
.
We can also do the same for 0.2
:
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1
0.6 × 2 = 1.2 → 1
...
Thus, 0.2
in binary is 0.001100110011...
.
Now before adding these two numbers, we're first going to store them in our 8-bit computer:
- Normalize using implicit normalization:
- Add bias to our exponents:
- for 0.1:
-4 + 7 = 3
- for 0.2:
-3 + 7 = 4
- for 0.1:
- Store in our 8-bit computer. We only have 3 bits for the fractional part, so we can only store 3 values after the radix point, meaning we lose all the remaining bits:
For 0.1
____________
|0|0011|100|
------------
For 0.2
____________
|0|0100|100|
------------
- Convert them back to their floating-point representations using our formula earlier
:
- 0.1:
- 0.2:
- Add them:
0.0001100
+ 0.001100
-----------
0.0100100
- Convert back to decimal. Values before the radix point will be converted as normal, while values after the radix point will have exponents with decreasing values of negative numbers, i.e.,
, etc.:
In our 8-bit computer,
0.1 + 0.2 = 0.28125
Conclusion
We started this article trying to understand why 0.1 + 0.2 sometimes does not give us 0.3. Using our 8-bit computer for illustration purposes, we've been able to come to that conclusion. This weird behavior, we came to understand, was a result of how computers normalize and represent floating-point numbers, which leads to loss of bits, ergo leading to loss of precision.
Luckily for us, we have 32-bit and 64-bit computers now. The Institute of Electrical and Electronics Engineers (IEEE) has standards for how these representations should be done. Thus, 32-bit systems allow 8 bits for the exponent and 23 bits for the mantissa, while 64-bit systems use 11 bits for the exponent and 52 bits for the mantissa. They both still use 1 bit for the sign. This means 64-bit systems have more bits to store the fractional parts, hence more precision. This is why 64-bit systems are known as double-precision and 32-bit systems are known as single-precision. The more bits we're able to store, the closer our approximation to the actual value.
Top comments (0)