TL;DR
For precise calculations, favor arbitrary-precision decimals or equivalents like BigDecimal over floating-point numbers.
Additionally, avoid unnecessary rounding. When required, limit rounding to the final step to maintain as much accuracy as possible.
Table of contents
- Prologue
- First things first
- Bits are not enough
- Bits and integers
- Bits and other real numbers
- Fixed-point representation
- Floating-point representation
- Issues and standards
- Floating-point data types
- Floating-point issues
- Decimals to the rescue
- Beware of rounding
- Decimals in other technologies
- Wrapping Up
- References
📜 Prologue
Oh yes, floating-point numbers.
They frequently appear in technical content, full of scientific notations and complex explanations.
It's almost certain that every programmer has already faced the notion that working with floating-point numbers can be perilous, resulting in imprecise arithmetic outcomes, among other issues.
However, comprehending all the underlying reasons behind this crucial topic in computer science can be challenging for many.
In today's post, we will delve into the problems that floating-point numbers address and explore the involved caveats.
So, grab a refreshing bottle of water, and let's embark on yet another journey into the realm of floating-point numbers.
👍🏼 First things first
Computers can only understand machine language.
Machine language is a collection of "bits" that contain data and instructions for the CPU. We represent those bits as binary bits and as such, it's called the base-2 numeral system (0 and 1).
01001001 01001000 11001011 01000001 01001000 10001000
01011001 01001000 01000001 01101001 01001000 01001001
11000001 10001000 01001001 11001010 10001000 01001000
11001001 01001000 11001001 01001000 01001000 01001001
Programming directly in machine language is highly error-prone and often inefficient in many scenarios. To address this, assembly languages were introduced over the years, serving as a bridge between CPU architecture specifics and a higher-level set of instructions.
Assembly languages are translated into machine code through a dedicated program called "assembler." Each CPU architecture typically has its own assembler associated with it.
This allows programmers to work with a more manageable and human-readable instruction set that is then translated into machine code specific to the target architecture.
section .data
number1 dd 10 ; Define the first number as a 32-bit float
number2 dd 20 ; Define the second number as a 32-bit float
section .text
global _start
_start:
; Load the first number into xmm0 register
movss xmm0, dword [number1]
; Load the second number into xmm1 register
movss xmm1, dword [number2]
.....
.....
Advancements in the field of computer engineering have paved the way for the development of increasingly high-level programming languages that can directly translate into machine code instructions.
Over the course of the following decades, languages like C, Java, and Python, among others, emerged, enabling individuals with limited knowledge of computer internals to write programs for computers.
This significant accomplishment has had a profound impact on the industry, as computers became more compact and faster, empowering modern software engineering practices to deliver substantial value to businesses worldwide.
🔵 Bits are not enough
As mentioned earlier, computers solely comprehend binary bits.
Nothing else in this world can be interpreted by computers.
Only. Bits.
💡 Actually, CPUs in electronic computers comprehend only the absence or presence of voltage, allowing us to represent information using 0 and 1 (off and on)
However, real-life scenarios present challenges where computer programs, which are created by people for people, need to represent a broader range of characters beyond just 0s and 1s. This includes letters, decimal numbers, hexadecimal numbers, special characters, punctuation marks, and even emojis like 😹.
Standard character sets such as ASCII and Unicode schemes solve the challenge of representing numbers, letters, special characters, emojis, and more within the binary system.
⚠️ Delving into the intricacies of character encoding is beyond the scope of this article. It will be covered in future posts
Here, our focus will be specifically on how computers work with numbers in memory, particularly integers.
🔵 Bits and integers
Let's take the number 65 as an example. It is represented in the base 10 numeral system, making it a real number.
Moreover, it is classified as an integer.
By performing conversions based on powers of 2, we can represent the integer 65 as 01000001
in an 8-bit binary format. This binary representation can be converted back and forth to the decimal value 65.
From a mathematical perspective, since 65 is an integer, it fits within a single byte (8 bits). Moreover, performing powers of 2, we know that a single byte can accomodate 256 numbers:
2^8 = 256
Naively speaking, one might assume that a single byte can represent integers ranging from 0 to 255.
However, integers must represent both negative and positive numbers. How should we evenly distribute those integers in a single byte?
We should employ a technique called two's complement.
👉 Two's complement
To evenly distribute negative and positive non-fractional integers within 8 bits, we can use a technique called two's complement. In this technique:
- the leftmost bit serves as the sign bit, indicating whether the number is positive or negative
- all the bits are flipped or inverted
- we then add 1 to the resulting value
This way, a single byte represents integers ranging from -128 to 127.
2^8 = 256
-127, -126, -125...127, 128
👉 Using two bytes
By employing the two's complement technique, we can also represent a range of integers using two bytes (16 bits). Utilizing the concept of powers of 2, we can observe that two bytes can accommodate a total of 65536 different values:
2^16 = 65536
Considering negative numbers, the range extends from -32768 to 32767, inclusive.
Now, let's explore some examples using PostgreSQL. If you prefer to work with containers, setting up a quick psql
terminal is straightforward. You can achieve it by running the following commands:
$ docker run --rm -d \
--name postgres \
-e POSTGRES_HOST_AUTH_METHOD=trust \
postgres
Then, access the psql
terminal with the following command:
$ docker exec -it postgres psql -U postgres
In PostgreSQL, the data type that represents a two-byte integer is called int2 or smallint:
SELECT 65::int2;
int2
------
65
To check the data type, we can use the function pg_typeof
:
SELECT pg_typeof(65::int2);
pg_typeof
-----------
smallint
As smallint uses two bytes, it can only accommodate the range we mentioned earlier in terms of bits and integers:
SELECT 32767::int2;
int2
-------
32767
SELECT -32767::int2;
int2
-------
-32767
However, if we attempt to exceed the range:
SELECT 32768::int2;
ERROR: smallint out of range
Pretty neat, isn't it?
In addition to smallint, PostgreSQL offers a variety of other integer data types:
Data Type | Description | Range of Integers |
---|---|---|
smallint | Two-byte integer | -32,768 to 32,767 |
integer | Four-byte integer | -2,147,483,648 to 2,147,483,647 |
bigint | Eight-byte integer | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
However, we all know that the world is not only integers. Integers are a subset of a broader set of numbers, called real numbers.
🔵 Bits and other real numbers
Real numbers can include integers, fractions, and decimals, both rational and irrational.
For instance, 3.14159 represents the real number π (pi), which is an irrational number. It is a non-repeating and non-terminating decimal. The value of π extends infinitely without any pattern in its decimal representation.
3.14159265358979323846....
Suppose we have two bytes (16 bits), which can represent 65536 integers ranging from -32768 to 32767.
When it comes to representing other real numbers, such as decimals, we can use a technique called fixed-point.
🔵 Fixed-point representation
In fixed-point representation, we split the provided 16 bits into three sections:
👉 Sign bit
The first bit (leftmost) represents the sign, being 1 for negative and 0 for positive.
👉 Decimal part
The next 7 bits represent the decimal (fracional) part, which can have a precision of up to 0.992188
in our simulation:
2^-7 + 2^-6 + ... + 2^-1 =
0.992188
👉 Integer part
The remaining 8 bits represent the integer part, which can go up to 127
using two's complement :
two_complement(
2^7 + 2^6 + ... + 2^1 =
127
)
Considering that the integer part, using 8 bits with two's complement, ranges from -128 to 127, we can conclude that, with fixed-point representation, decimals can range from -128.992188 to 128.992188.
However, this technique may not always be the most efficient. Let's explore another technique for representing decimals.
Yes, we are talking about the widely used floating-point representation.
🔵 Floating-point representation
Taking 16 bits still as an example, in floating-point representation we also split the 16 bits into three groups:
👉 Sign bit
The first bit (leftmost) is used to represent whether the number is negative (1) or positive (0).
👉 Exponent part
This crucial component, known as the floating-point, is assigned the next X bits, signifying its importance.
For our simulation, let's allocate 7 bits for the exponent part, while utilizing the first exponent bit for the exponent sign.
As a result, the range for the exponent extends from -63 to 63, accommodating both negative and positive values:
2^5 + 2^4 + ... 2^1 =
63
This part is crucial for defining arithmetic precision in floating-point representation.
👉 Mantissa
The Mantissa part, also known as the significand, takes the remaining 8 bits, allowing for a range of up to 255.
As we are not representing the integer part in this simulation, there is no need to apply two's complement to the mantissa.
🔑 Now the key part
To calculate the maximum positive floating-point number, we multiply the mantissa by the exponent.
In this case, the maximum positive value would be obtained by multiplying 255 by 2^6, resulting in an exceedingly large number like 2351959869397967831040.0.
Conversely, the minimum positive number can be represented as 1 multiplied by 2^-63, or 0.00000000000000000010842021724855044340074528008699.
Please note that this simulation is a simplified representation with limited precision and may not reflect the accuracy of ideal or standardized floating-point formats.
🔵 Issues and standards
Indeed, as mentioned earlier, selecting an appropriate number of bits for the exponent part in floating-point representation is crucial to mitigate issues with rounding and truncation when handling fractional numbers.
Standards like IEEE 754 were established precisely to address these concerns and provide a consistent framework for floating-point representation. The IEEE 754 standard defines the number of bits allocated to the exponent, mantissa, and sign in both single precision (32 bits) and double precision (64 bits) formats.
These standards determine the precise representation of the various components of a floating-point number, the rules for arithmetic operations, and how to handle exceptional cases.
👉 Single precision (4 bytes)
Single precision numbers are represented using 32 bits of memory.
They include:
- 1 bit for the sign of the number
- 8 bits for the exponent
- 23 bits for the mantissa
According to the IEEE standards, single precision can typically handle 6 to 9 decimal place precision.
👉 Double precision (8 bytes)
Double precision numbers are represented using 64 bits of memory.
They include:
- 1 bit for the sign of the number
- 11 bits for the exponent
- 52 bits for the mantissa
According to the IEEE standards, double precision can handle 15 to 17 decimal places of precision.
Usually, double-precision fits better when high precision is mandatory, but it consumes more memory.
🔵 Floating-point data types
Many programming languages and database systems adhere to the IEEE 754 standards, and PostgreSQL is no exception.
Let's see how PostgreSQL implement float data types in action.
The datatype float4 conforms to the IEEE 754 single-precision standard, which allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa:
SELECT 0.3::float4;
float4
--------
0.3
Conversely, the datatype float8 conforms to the IEEE 754 double-precision standard, which allocates 1 bit for the sign, 811bits for the exponent, and 52 bits for the mantissa:
SELECT 0.3::float8;
float8
--------
0.3
#####################
SELECT 0.3::float;
float
--------
0.3
The default float
falls back to double-precision (float8).
☣️ Floating-point issues in action
Let's dive into calculations with floating-point numbers and see the potential issues in action.
Take a straightforward sum of 0.1 + 0.2
:
SELECT 0.1::float + 0.2::float;
0.30000000000000004
This result shows how precision issues can arise in double-precision floating-point numbers during arithmetic operations. Even when following standards, we are not immune to these floating-point calculation challenges.
However, there's an alternative strategy that involves a nifty trick using integers.
💡 A trick with integers
Instead of the float data type, we can work with integers. We incorporate a multiplier factor based on a decimal scale when storing values, and then divide by the same factor to restore the original decimal representation when retrieving the value.
This method enables precise decimal calculations by leveraging integers and scaling. The multiplier factor should be chosen based on the required decimal precision.
To demonstrate, let's use this trick to perform 0.1 + 0.2
:
SELECT (0.1 * 1000)::int + (0.2 * 1000)::int;
300
Here, each input is multiplied by 1000
and then converted to an integer. To retrieve the original value without losing precision, we divide by 1000
:
SELECT (300 / 1000::float);
0.3
Yay! 🚀
However, using a fixed multiplier factor may be inefficient when dealing with inputs that have varying decimal places.
Instead, a variable-scale representation could be employed by converting the input into a string and parsing the number of decimal digits.
But be aware, variable-scale decimal representations demand careful handling of complex calculations, precise decimal scaling, and various other intricacies of decimal arithmetic.
This is where decimals come in.
🔵 Decimals to the rescue
Decimals address the challenges associated with complex arithmetic calculations involving decimals. They significantly reduce the precision issues commonly encountered with floating-point numbers.
Various programming languages and database systems have implemented decimals. PostgreSQL provides the datatype decimal, which offers superior precision compared to floats.
SELECT 0.1::decimal + 0.2::decimal;
0.3
Decimals can also be configured for arbitrary precision and scale:
# Example: accepts numbers up to 999.99
SELECT 0.1::decimal(5, 2);
0.10
SELECT 999.99::decimal(5, 2);
999.99
Handily, the default datatype for decimals in PostgreSQL is numeric, which is identical to decimal:
SELECT pg_typeof(0.1);
numeric
⚠️ Beware of rounding
Rounding decimal numbers programmatically can lead to imprecise results. For instance, the sum 25.986 + -0.4125 + -25.5735
should theoretically yield zero:
SELECT 25.986 + -0.4125 + -25.5735;
0.0000
Let's illustrate how we can round only the final sum to two decimal places:
SELECT ROUND(25.986 + -0.4125 + -25.5735, 2);
0.00
So far, so good, it works as expected.
With proper datatypes such as decimal, the arithmetic issue inherent to floating-point numbers is already addressed.
But rounding introduces its own set of challenges. Even if decimals are excellent for precision and arithmetic of decimal data, rounding operations inherently involve some degree of approximation.
Now, let's round each number before summing:
SELECT ROUND(25.986, 2) + ROUND(-0.4125, 2) + ROUND(-25.5735, 2);
0.01
Uh, oh 😭
Every time we round some number, we’re adding a bit of imprecision. Bit by bit, the final result might be too far from the expected.
These examples underline why unnecessary rounding should be avoided. As rounding is an approximation, it's best to postpone it until the final step, i.e., when presenting the data to the end user.
➕ Decimals in other technologies
Every programming language or technical tool has its own data type for handling arbitrary precision, such as PostgreSQL's decimals.
Ruby offers the BigDecimal class, which facilitates arbitrary-precision floating-point decimal arithmetic.
Similarly, Java also includes a BigDecimal class.
Go language is no exception; it too has arbitrary-precision decimal arithmetic.
It's crucial to verify that the technology you're using provides support for arbitrary precision. If you require greater accuracy, these solutions are often more suitable than using raw floating-point numbers.
Wrapping Up
In this post, we delved into the intricacies of floating-point numbers.
We explored how computers comprehend information through the binary system, from integer representation and fixed-point representation's inefficiency for decimals, to floating-point numbers and their caveats.
We also investigated how arbitrary-precision data types like decimal address these precision issues.Furthermore, we discussed rounding issues and shared best practices for dealing with them.
I hope these complex topics have been presented in a way that's easy to understand, making floating-point issues no longer an issue!
Cheers!
References
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
https://www.postgresql.org/docs/current/datatype.html
https://en.wikipedia.org/wiki/IEEE_754
https://www.doc.ic.ac.uk/~eedwards/compsys/float/
https://en.wikipedia.org/wiki/Floating-point_error_mitigation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
https://en.wikipedia.org/wiki/Decimal_floating_point
Top comments (2)
great article! so that’s why some banks store the transaction value as a integer. Where 14.67 becomes 1467
01001110 01101001 01100011 01100101 00100000 01100001 01110010 01110100 01101001 01100011 01101100 01100101 !