English and Chinese are available.
1. Defination of Constants
b，base(radix)，2 or 10
emax，max value of exponent
emin，min value of exponent
For all floating-point format，
emin = 1- emax
2. Representation of floating-point format
Floating-point format is consisted of radix, encoding bits. (i.e. binary64)
Floating-point number should be represent to this:
(−1)s × be×m
- S: is 0 or 1
- e: emin <= e <= emax
- m is a number represented by a digit string of the form d0・d1 d2...dp-1 where di is an integer digit 0<= di <=b(therefor 0 <= m <=b)
- Two infinities, +∞ and −∞.
- Two NaNs(Not a numbers), qNaN (quiet) and sNaN (signaling)
3. Binary Floating Number
The binary floating-point format is the familiar representation, which is also the standard of IEEE754-1985.
A binary floating-point number will be represented in this form:
- S: sign 1-bit
- E: exponent w-bit E = e + bias
- T: tail t-bit t = p - 1, T = d1d2⋯dp−1 [di: binary number]
Example of binary32
1). Normalized number
E is an 8-bit unsigned integer, representing a range of 0 to 255, and E-bias is a range of -127 to 128
E0 ~ E W-1 is not all 0 (E = 0, E-bias is -127) or is not all 1 (E = 255, E-bias is 128).
Now, the formula of floating-point number is (-1)S × 2E-bias × m.
m = 1 + 2-t T = 1 + ∑(i: 1 ~ t) 2-idi
E-bias: (-127, 128)
T is represented as a floating-point number that is greater than or equal to 1 and less than or equal to 2 in scientific notation.
S, E, T should be representing and storing in binary.
-> Trans to binary：1001.0
-> Representation：-10 * 2(3+127) * 1.001 -> -10 * 2(130) * 1.001 (bias: 127)
S(+) E(130) T(1)
0 10000010 00000000000000000000001
Why does bisa exist?
For E (exponent), E is an unsigned integer, so the value of E is in the range (0~ 255). However, the exponent can be negative in counting, and for the sake of range symmetry, it is required to add the middle number (127) to the original value of E when storing it, and subtract the middle number (127) when using it. So the real range of E is (-127 to 128).
2). Subnormalized Number
When E0 ~ Ew-1 are 0 (E is 0, the exponent(E-bias) is -126 [NOT -127, due to an unrepresented exception]), leading bit from 1 to 0。
Due to 0<= T< 2t, 1<= m < 2，normalized number can't represent 0. Therefore, when abs(a number) < bemin, the number will transform to subnormalized number format to represent.
Now, the formula of floating-point number is (−1)S×2emin−tT = (-1)S2emin∑(i=1 ~ t) 2-idi
(0 is represented to:
S 000...000 )
+∞：S -- 0 E -- Ei = 1 T = 0
-∞: S -- 1 E -- Ei = 1 T = 0
NaN：Each bit of E is 1
The difference of quiet NaN and signaling NaN is flag bit of significand segment.
Quiet NaN does not raise any additional exceptions (FPUs do not raise hardware exceptions) and they are used in most operations. The exception is that you cannot simply pass NaN to the output intact, for example during format conversion or some comparison operations.
The opposite is Signaling NaN
4. Decimal Floating-Point Format Number
Take a brief look
Decimal Floating-Point format number has two encoding method, one called DPD(Densely Packed Decimal), and the other called BIS(a.k.a BID Binary Integer Decimal).
4.0 Value of Decimal Floating-Point Number
(-1)S * T * 10E - bias
The values of T and E are calculated from the combined part and the respective continuation part (mantissa).
- S: Sign 1-bit
- Comb: Combination
- E: Exponent w-bit E = e + bias
- T: Tail t-bit t = p - 1, T = d1d2⋯dp−1
4.2 Significand -- Difference between BID & DPD
BID and DPD coding are used to code each part into binary for storage. The difference is in the significand section, BID is to directly take the significant digital part of scientific enumbering method and convert it into binary for storage. DPD coding uses a mapping table, and every 3 decimal digits correspond to 10 binary digits. To store.
As for decimal encoding, some hardware directly supports decimal processing, such as IBM POWER. At this time, this standard is directly used to store and calculate numbers, or DPD encoding is needed to convert binary for storage.
In both DPD and BID cases, the most significant 4 bits of the significand (which actually only have 10 possible values [0~9]) are combined with the most significant 2 bits of the exponent (3 possible values) to use 30 of the 32 possible values of a 5-bit field. The remaining combinations encode infinities and NaNs.
|Combination field||Exponent Msbits||Significand Msbits||Other|
|11111||—||—||NaN. Sign bit ignored. First bit of exponent continuation field determines if NaN is signaling.|
0 and 100 on Significand Msbits is NOT represent in significand
This part can be calculate from Comb part.
The Comb part occupied 5 bis for Decimal64 and the 5 bits come from exponent(E) and significand(T).
- G0G1 G2G3G4
G0G1 : is the most two bits of exponent.
G2G3G4 : is the most three bits of significand.
- 11 G2G3 G4
G2G3 : is the most two bits of exponent.
G4: 8(10) + G4 (2) is the most bits of significand.
- 1111 G4
Special value, Infinity or NaN
4.3.2 BIS (a.k.a. BID: Binary Integer Decimal)
This means encodes the significand part to binary directly.
The start of exponent bit and significand are uncertain.
The start of exponent is up to the most two bits of the Comb.
Therefore, we index each bit of whole floating-number model from b0 to bk-1.
- When b1b2 (2) != 11(2), the exponent part is consist of b1b2 and last w bits, the rest of bits are the significand.
- When b1b2 (2) == 11(2) and b3b4 (2) != 11(2), the exponent part is consist of b3b4 and last w bits, the rest of bits are the significand.
- ∞ and NaN are following DPD rules.
As for non-special value, the exponent consist of two Comb(00/01/10) bit and last w bits for BID. Total kinds: 3 * 2w
As for DPD, the Comb has 5 bits and the Exponent has 8 (w)bits, which consist of two Comb(00/01/10) bit(hiden) and last w bits.
In both cases(BID DPD), the most significand which is hiden comes from the Comb part and the rest bits are rest of significand.
T will be encoded in binary directly.
For BCD encoding, uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth(10 used / 16 total states).
In DPD encoding, an encoding that maps from decimal to binary is used in order to use the decimal in the mantissa, which is not code like 8421BCD -- it is a waste of space. For the new code, we want to find a positive power of two, such that the ratio of the smaller and closest positive power of ten to it is as close as possible to 1, in order to save space. On the other hand, the positive integer power of the two should be as small as possible, so that the granularity can be small, and it is easier to allocate space to floating-point format with small space.
We use 10bits(0 ~ 210-1=1023) to represent 3 decimal digits.(0~999)
(If we use BCD encoding method to represent decimal digits, 10bits can represent 2~3digits.)
How to represent digits by DPD table?
To the left are the DPD encoded values, and to the right are the original three-digit decimal digits (a.k.a: a declet).
i.e. Let's describe line 3.
c which have a green background,
i which have a green background,
f which has a purple background, are same digits on both side. The only difference between both sides the relative positions.
The left binary sequence encoded by DPD
abcghf101i(10bits) is corresponding to three origin digits separated encoded in binary(BCD code),
How to represent in fomula?
0abc(2) * 100(10) +
100f(2) * 10(10) +
0ghi(2) * 1(10)
[0b9b8b7](2) * 100(10) + [100b4](2) * 10(10) + [0b6b5b0](2) * 1(10)
s 00eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt s 01eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt s 10eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1100eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt s 1101eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt s 1110eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 11110 xx...x ±infinity s 11111 0x...x qNaN s 11111 1x...x sNaN
s 00 TTT (00)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt] s 01 TTT (01)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt] s 10 TTT (10)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1100 T (00)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt] s 1101 T (01)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt] s 1110 T (10)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
decimal64 floating-point format - Wikipedia
Top comments (0)