(This Blog post is part of a collaborative work between Me and Youssef El Idrissi, Consult his devTo page for more information: https://dev.to/0xw3ston)
The Backstory...
Long ago Computer Engineers found themselves in a tricky pickle, they had no way to work with Numbers that are not "full-Integers" (aka Numbers that have fractions in them).
They came up with something called "Fixed-Point Numbers":
Fixed-Point Numbers
- The DOT in pink is called a radix point, it's what separates the integer value from the decimal value.
Fixed-Point is about:
- giving the integer side X bits in order to represent it in binary
- giving the fraction side Y bits in order to represent it in binary
Here are a few examples
(In our examples we assign 4-bits for the fraction part, 4-bits for the integer part)
- Examples 1 and 2 are normal since the 8 bits cover the value(s)
- however, in Examples 3 and 4 you can clearly see some issues arise when it comes on how to account for negative numbers and for accuracy.
For this reason there was another standard defined, "Floating-Point Numbers":
Floating-Point Numbers
What makes Floating-Point Numbers so elegant is the direct inspiration they have drawn from Scientific notation writing. it is a formal standard (IEEE 754)
In High School, it was taught to short-write either super big or super miniscule numbers like so.
135600000 = 1,356 * 10^8
and
0,00005997758 = 5,997758 * 10^-5
with a paper and pen (it's generally) okay to write numbers both ways (as they are, or in scientific notation)
But with computers how data is stored is very critical and requires bit-level accuracy to not get the wrong results.
Single-Precision Floating Point
32-bit representation
There's multiple variations of Floating-Point Numbers, one of them is a the 32-bit representation.
Usually in all of the Floating-Point variations we find three main components:
- Mantissa: This part represents the part of our number that cannot be further encoded
- Exponent: That part represents the exponent by which we multiply the Mantissa to get the original number back.
- Sign: this is a one bit flag that tells us if the number is negative or positive.
Encoding
To encode a number from base-10 to base-2:
- We take the integer side and turn it into binary
- We take the fraction/decimal side and turn it into binary
- We then slide the radix point until we get a
1on the left side - And lastly we apply the "Bias" which we will explain later.
Encoding
Integer to Binary Conversion
We do the following Calculation to turn the integer into binary
| Value (Integer) | bit |
|---|---|
| 124 mod 2 | 0 |
| 62 mod 2 | 0 |
| 31 mod 2 | 1 |
| 15 mod 2 | 1 |
| 7 mod 2 | 1 |
| 3 mod 2 | 1 |
| 1 mod 2 | 1 |
1111100
Fraction/Decimal to Binary Conversion
We do the following Calculation to turn the decimal into binary
| Value (Fraction) | bit |
|---|---|
| 0.25568 . 2 = 0.51136 | 0 |
| 0.51136 . 2 = 1.02272 | 1 |
| 0.02272 . 2 = 0.04544 | 0 |
| 0.04544 . 2 = 0.09088 | 0 |
| 0.0988 . 2 = 0.18176 | 0 |
| 0.18176 . 2 = 0.36352 | 0 |
| 0.3652 . 2 = 0.72704 | 0 |
| 0.72704 . 2 = 1.45408 | 1 |
| 0.45408 . 2 = 0.90816 | 0 |
| 0.90816 . 2 = 1.81632 | 1 |
| 0.81632 . 2 = 1.63264 | 1 |
| 0.63264 . 2 = 1.26528 | 1 |
| 0.26528 . 2 = 0.53056 | 0 |
| 0.53056 . 2 = 1.06112 | 1 |
| ... | ... |
.01000001011101...
So once we are done we are left with 1111100.01000001011101
We have to normalize that to get the scientific notation for this number, we would slide the radix point to the left (or the right, depending on where the left upmost "1" exists) until there is exactly one single "1" on the left side of the radix point, like so
1111100.01000001011101100 = 1.11110001000001011101100 x 2^6
Then comes the Adding Bias step, first what's a Bias ?
-
Bias: Bias is a fixed value added to the real exponent so the exponent can be stored as an unsigned integer. Instead of storing negative exponents directly, we store:
Stored Exponent = Real exponent + Bias
for example Single-precision Floating-point uses a bias of 127
So in this case The bias is 6 + 127 which is 133 and in binary 133 = 10000101
So we're left with:
- Sign: 0 (positive number)
- Exponent: 10000101
- Mantissa: 111100....
In normalized IEEE 754 numbers, the leading 1 (highlighted in red in the diagram below) before the radix point is implicit and therefore not stored. This is known as the hidden bit optimization.
Decoding
Decoding is fairly simple
- Extract the Exponent
- Remove the Bias from the exponent to get the real exponent (which is
6in this case) - Extract the Mantissa
- Reconstruct the equation mantissa x 2^exponent
which later on gives us
This way we managed to Decode the Floating Point.
Other Precision
Half-Precision Floating-Point
Used mostly in AI and Graphics
Double-Precision Floating-Point
Mostly used in Scientific Computer such as Physics Simulations, Engineering calculations & Numerical methods.
Quad-Precision Floating-Point
Used in High-Precision scientific simulations & Financial systems where the error rate should be minimized aggressively, it is present in Astrophysics, Climate modeling, Computational number theory etc..
Octuple-Precision Floating-Point
Used in Cryptography research, symbolic mathematics, experimental physics and extremely sensitive numerical algorithms










Top comments (0)