Pre-requisite
Before we talk about floating point, let’s remind ourselves of a few pre-requisite topics. Firstly, converting decimal to binary and then we can look at scientific notation.
0. Converting Decimal to Binary
To read floating point numbers, we need to know how to write numbers in binary (base 2).
- Whole numbers are divided by 2 repeatedly, tracking remainders.
- Fractions are multiplied by 2 repeatedly, tracking integer parts.
Note on Subscripts
When you see something like , the subscript tells you the base:
- means "13 in base 10" (decimal).
- means "1101 in base 2" (binary), which equals 13 in decimal.
Example 1: Whole Number
Convert to binary:
- remainder $1$
- remainder $0$
- remainder $1$
- remainder $1$
Reading remainders bottom to top: .
Example 2: Fraction
Convert to binary:
- → take
- → take .
- → take
So .
Putting It Together
.
1. Scientific Notation Refresher (Decimal → Binary)
In decimal (base 10):
We moved the decimal point three places left, and recorded that movement as .
In binary (base 2), the idea is the same.
Here, we shifted the binary point two steps left and tracked it with
.
This trick—“one digit before the point, then a power for the shifts”—is exactly what floating point uses.
2. Why We Need a Rulebook
Computers don’t store decimals like humans do. They only store 1s and 0s. But many decimals (like ) can’t be written exactly in binary.
Example:
It repeats forever.
If each computer chopped that repeating fraction differently, then could give slightly different results on every machine. One might say , another .
To avoid chaos, everyone agreed on a standard: IEEE-754. It tells computers exactly how to break numbers into parts, how to round them, and how to agree on results.
3. Breaking Numbers Into Pieces
IEEE-754 splits every floating-point number into three parts:
- The sign bit → Are we positive or negative?
- The exponent → How far the binary point jumps.
- The mantissa (fraction field) → The detailed digits after the binary point.
For a 32-bit float, the layout is:
Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits) |
---|---|---|
0 | 10000001 | 01100000000000000000000 |
(Example above is for 5.5)
Example:
Write in binary:
Normalize it:
Break it down:
- Sign = 0 (positive).
- Exponent = 2 (we shifted 2 places).
- Mantissa = 1001 (the part after the point).
So 6.25 would be stored as:
We’ll see the bias trick in a moment.
4. The Magic Formula
Every IEEE-754 floating point number follows this formula:
- The makes the number positive or negative.
- The exponent in the bits isn’t the real exponent—it’s “biased.”
- The mantissa is the fractional detail after the binary point.
About the Bias
The bias is just a fixed offset to avoid negative numbers in the exponent field.
- For 32-bit floats, bias = 127.
- For 64-bit doubles, bias = 1023.
Encoding:
Decoding (using the formula):
Example with 6.25:
- Real exponent = 2.
- Stored exponent = 2 + 127 = 129.
- When decoding: 129 - 127 = 2.
5. Anatomy of a Float (32-bit)
A float is 32 bits split as:
1 bit sign | 8 bits exponent | 23 bits mantissa |
---|
Bias = 127.
Example: 5.5
1.Binary form:
2.Normalize:
3.Fields:
- Sign = 0.
- Stored exponent =
.
- Mantissa = 011.
Final layout:
0 10000001 01100000000000000000000
6. Double Precision (64-bit)
Same pattern, but with more bits:
1 bit sign | 11 bits exponent | 52 bits mantissa |
---|
Bias = 1023.
This gives more precision and a wider range of values.
7. Special Rules
Not every pattern of bits maps to a normal number. IEEE-754 defines special cases:
- Exponent = all 0s → subnormal numbers (tiny values, no hidden 1).
- Exponent = all 1s → infinity or NaN (“Not a Number”).
- Mantissa = 0 → exact powers of two.
8. Limits and Rounding
Floats have limited mantissa bits:
- 23 bits for float.
- 52 bits for double.
That means fractions like 0.1 or don’t fit exactly. They get rounded to the nearest representable number.
Example in C++:
std::cout << 0.1;
// 0.10000000149011612
That’s not wrong—it’s just the closest binary approximation.
9. Big Picture
Floating point is just scientific notation in base 2.
- Sign bit = the plus/minus switch.
- Exponent = how far the decimal point hops.
- Mantissa = the zoomed-in detail.
- Limits = like building with a fixed number of Lego bricks—you can’t always make the perfect shape, only the closest one.
Top comments (0)