appsbymuss for #./TECHLAB.MA

Posted on Feb 24

Floating-Point Numbers

#learning #computerscience #softwareengineering #science

(This Blog post is part of a collaborative work between Me and Youssef El Idrissi, Consult his devTo page for more information: https://dev.to/0xw3ston)

The Backstory...

Long ago Computer Engineers found themselves in a tricky pickle, they had no way to work with Numbers that are not "full-Integers" (aka Numbers that have fractions in them).

They came up with something called "Fixed-Point Numbers":

Fixed-Point Numbers

The DOT in pink is called a radix point, it's what separates the integer value from the decimal value.

Fixed-Point is about:

giving the integer side X bits in order to represent it in binary
giving the fraction side Y bits in order to represent it in binary

Here are a few examples
(In our examples we assign 4-bits for the fraction part, 4-bits for the integer part)

Examples 1 and 2 are normal since the 8 bits cover the value(s)
however, in Examples 3 and 4 you can clearly see some issues arise when it comes on how to account for negative numbers and for accuracy.

For this reason there was another standard defined, "Floating-Point Numbers":

Floating-Point Numbers

What makes Floating-Point Numbers so elegant is the direct inspiration they have drawn from Scientific notation writing. it is a formal standard (IEEE 754)

In High School, it was taught to short-write either super big or super miniscule numbers like so.

135600000 = 1,356 * 10^8
and
0,00005997758 = 5,997758 * 10^-5

with a paper and pen (it's generally) okay to write numbers both ways (as they are, or in scientific notation)
But with computers how data is stored is very critical and requires bit-level accuracy to not get the wrong results.

Single-Precision Floating Point

32-bit representation

There's multiple variations of Floating-Point Numbers, one of them is a the 32-bit representation.

Usually in all of the Floating-Point variations we find three main components:

Mantissa: This part represents the part of our number that cannot be further encoded
Exponent: That part represents the exponent by which we multiply the Mantissa to get the original number back.
Sign: this is a one bit flag that tells us if the number is negative or positive.

Encoding

To encode a number from base-10 to base-2:

We take the integer side and turn it into binary
We take the fraction/decimal side and turn it into binary
We then slide the radix point until we get a 1 on the left side
And lastly we apply the "Bias" which we will explain later.

Encoding

Integer to Binary Conversion

We do the following Calculation to turn the integer into binary

Value (Integer)	bit
124 mod 2	0
62 mod 2	0
31 mod 2	1
15 mod 2	1
7 mod 2	1
3 mod 2	1
1 mod 2	1

1111100

Fraction/Decimal to Binary Conversion

We do the following Calculation to turn the decimal into binary

Value (Fraction)	bit
0.25568 . 2 = 0.51136	0
0.51136 . 2 = 1.02272	1
0.02272 . 2 = 0.04544	0
0.04544 . 2 = 0.09088	0
0.0988 . 2 = 0.18176	0
0.18176 . 2 = 0.36352	0
0.3652 . 2 = 0.72704	0
0.72704 . 2 = 1.45408	1
0.45408 . 2 = 0.90816	0
0.90816 . 2 = 1.81632	1
0.81632 . 2 = 1.63264	1
0.63264 . 2 = 1.26528	1
0.26528 . 2 = 0.53056	0
0.53056 . 2 = 1.06112	1
...	...

.01000001011101...

So once we are done we are left with 1111100.01000001011101

We have to normalize that to get the scientific notation for this number, we would slide the radix point to the left (or the right, depending on where the left upmost "1" exists) until there is exactly one single "1" on the left side of the radix point, like so
1111100.01000001011101100 = 1.11110001000001011101100 x 2^6

Then comes the Adding Bias step, first what's a Bias ?

Bias: Bias is a fixed value added to the real exponent so the exponent can be stored as an unsigned integer. Instead of storing negative exponents directly, we store: Stored Exponent = Real exponent + Bias

for example Single-precision Floating-point uses a bias of 127

So in this case The bias is 6 + 127 which is 133 and in binary 133 = 10000101

So we're left with:

Sign: 0 (positive number)
Exponent: 10000101
Mantissa: 111100....

In normalized IEEE 754 numbers, the leading 1 (highlighted in red in the diagram below) before the radix point is implicit and therefore not stored. This is known as the hidden bit optimization.

Decoding

Decoding is fairly simple

Extract the Exponent
Remove the Bias from the exponent to get the real exponent (which is 6 in this case)
Extract the Mantissa
Reconstruct the equation mantissa x 2^exponent

which later on gives us

This way we managed to Decode the Floating Point.

Other Precision

Half-Precision Floating-Point

Used mostly in AI and Graphics

Double-Precision Floating-Point

Mostly used in Scientific Computer such as Physics Simulations, Engineering calculations & Numerical methods.

Quad-Precision Floating-Point

Used in High-Precision scientific simulations & Financial systems where the error rate should be minimized aggressively, it is present in Astrophysics, Climate modeling, Computational number theory etc..

Octuple-Precision Floating-Point

Used in Cryptography research, symbolic mathematics, experimental physics and extremely sensitive numerical algorithms

Top comments (1)

klement Gunndu • Feb 25

Clean walkthrough of the bias trick in IEEE 754 -- the step-by-step encoding diagram makes it click. Does denormalized mode come up in a follow-up post?