DEV Community

Zhi Song

Posted on • Updated on

Basic Understanding of Decimal Floating-Point Number

English and Chinese are available.

1. Defination of Constants

1. `b`，base(radix)，2 or 10

2. `p`，precision

3. `emax`，max value of exponent

4. `emin`，min value of exponent

For all floating-point format，`emin = 1- emax`

2. Representation of floating-point format

• Floating-point format is consisted of radix, encoding bits. (i.e. binary64)

• Floating-point number should be represent to this:

(−1)s × be×m

1. S: is 0 or 1
2. e: emin <= e <= emax
3. m is a number represented by a digit string of the form d0・d1 d2...dp-1 where di is an integer digit 0<= di <=b(therefor 0 <= m <=b)
4. Two infinities, +∞ and −∞.
5. Two NaNs(Not a numbers), qNaN (quiet) and sNaN (signaling)

3. Binary Floating Number

The binary floating-point format is the familiar representation, which is also the standard of IEEE754-1985.

A binary floating-point number will be represented in this form:

1. S: sign 1-bit
2. E: exponent w-bit E = e + bias
3. T: tail t-bit t = p - 1, T = d1d2⋯dp−1 [di: binary number]
Param binary16 binary32 binary64 binary128 binary{k}
k 16 32 64 128 k, 32\
p 11 24 53 113 k-round(4log2 k)+13
emax 15 127 1023 16383 2^{k-p-1}-1
bias 15 127 1023 16383 emax
sign bit 1 1 1 1 1
w 5 8 11 15 round(4log2 k)−13
t 10 23 52 112 k-w-1

Example of binary32

1). Normalized number

E is an 8-bit unsigned integer, representing a range of 0 to 255, and E-bias is a range of -127 to 128

E0 ~ E W-1 is not all 0 (E = 0, E-bias is -127) or is not all 1 (E = 255, E-bias is 128).

Now, the formula of floating-point number is (-1)S × 2E-bias × m.

m = 1 + 2-t T = 1 + ∑(i: 1 ~ t) 2-idi

E-bias: (-127, 128)

T is represented as a floating-point number that is greater than or equal to 1 and less than or equal to 2 in scientific notation.

S, E, T should be representing and storing in binary.

Example:

float(binary32): 9.0

-> Trans to binary：1001.0

-> Representation：-10 * 2(3+127) * 1.001 -> -10 * 2(130) * 1.001 (bias: 127)

-> Layout:

S(+) E(130) T(1)

0 10000010 00000000000000000000001

Tip.

Why does bisa exist?

For E (exponent), E is an unsigned integer, so the value of E is in the range (0~ 255). However, the exponent can be negative in counting, and for the sake of range symmetry, it is required to add the middle number (127) to the original value of E when storing it, and subtract the middle number (127) when using it. So the real range of E is (-127 to 128).

2). Subnormalized Number

When E0 ~ Ew-1 are 0 (E is 0, the exponent(E-bias) is -126 [NOT -127, due to an unrepresented exception]), leading bit from 1 to 0。

Due to 0<= T< 2t, 1<= m < 2，normalized number can't represent 0. Therefore, when abs(a number) < bemin, the number will transform to subnormalized number format to represent.

Now, the formula of floating-point number is (−1)S×2emin−tT = (-1)S2emin∑(i=1 ~ t) 2-idi

(0 is represented to: `S 000...000` )

3.）Special Values

`+∞`：S -- 0 E -- Ei = 1 T = 0

`-∞`: S -- 1 E -- Ei = 1 T = 0

`NaN`：Each bit of E is 1

1. The difference of quiet NaN and signaling NaN is flag bit of significand segment.

2. Quiet NaN does not raise any additional exceptions (FPUs do not raise hardware exceptions) and they are used in most operations. The exception is that you cannot simply pass NaN to the output intact, for example during format conversion or some comparison operations.

The opposite is Signaling NaN

4. Decimal Floating-Point Format Number

Take a brief look

Decimal Floating-Point format number has two encoding method, one called DPD(Densely Packed Decimal), and the other called BIS(a.k.a BID Binary Integer Decimal).

4.0 Value of Decimal Floating-Point Number

(-1)S * T * 10E - bias

The values of T and E are calculated from the combined part and the respective continuation part (mantissa).

4.1 Organization

1. S: Sign 1-bit
2. Comb: Combination
3. E: Exponent w-bit E = e + bias
4. T: Tail t-bit t = p - 1, T = d1d2⋯dp−1
Param decimal32 decimal64 decimal128 decimal{k}
k 32 64 128 k, 32\
p 7 16 34 9k/32-2
emax 96 384 6144 3*2^(k/16+3)
bias 101 398 6167 emax+p-2
sign bit 1 1 1 1
w 6 8 12 k/16+4
t 20 50 110 15*k/16-10

4.2 Significand -- Difference between BID & DPD

BID and DPD coding are used to code each part into binary for storage. The difference is in the significand section, BID is to directly take the significant digital part of scientific enumbering method and convert it into binary for storage. DPD coding uses a mapping table, and every 3 decimal digits correspond to 10 binary digits. To store.

As for decimal encoding, some hardware directly supports decimal processing, such as IBM POWER. At this time, this standard is directly used to store and calculate numbers, or DPD encoding is needed to convert binary for storage.

4.3 Comb

In both DPD and BID cases, the most significant 4 bits of the significand (which actually only have 10 possible values [0~9]) are combined with the most significant 2 bits of the exponent (3 possible values) to use 30 of the 32 possible values of a 5-bit field. The remaining combinations encode infinities and NaNs.

Combination field Exponent Msbits Significand Msbits Other
00mmm 00 0mmm
01mmm 01 0mmm
10mmm 10 0mmm
1100m 00 100m
1101m 01 100m
1110m 10 100m
11110 ±Infinity
11111 NaN. Sign bit ignored. First bit of exponent continuation field determines if NaN is signaling.

0 and 100 on Significand Msbits is NOT represent in significand

This part can be calculate from Comb part.

4.3.1 DPD

The Comb part occupied 5 bis for Decimal64 and the 5 bits come from exponent(E) and significand(T).

• G0G1 G2G3G4

G0G1 : is the most two bits of exponent.

G2G3G4 : is the most three bits of significand.

• 11 G2G3 G4

G2G3 : is the most two bits of exponent.

G4: 8(10) + G4 (2) is the most bits of significand.

• 1111 G4

Special value, Infinity or NaN

4.3.2 BIS (a.k.a. BID: Binary Integer Decimal)

This means encodes the significand part to binary directly.

The start of exponent bit and significand are uncertain.

The start of exponent is up to the most two bits of the Comb.

Therefore, we index each bit of whole floating-number model from b0 to bk-1.

The rules:

• When b1b2 (2) != 11(2), the exponent part is consist of b1b2 and last w bits, the rest of bits are the significand.
• When b1b2 (2) == 11(2) and b3b4 (2) != 11(2), the exponent part is consist of b3b4 and last w bits, the rest of bits are the significand.
• ∞ and NaN are following DPD rules.

4.4 Exponent

As for non-special value, the exponent consist of two Comb(00/01/10) bit and last w bits for BID. Total kinds: 3 * 2w

As for DPD, the Comb has 5 bits and the Exponent has 8 (w)bits, which consist of two Comb(00/01/10) bit(hiden) and last w bits.

4.5 Significand

In both cases(BID DPD), the most significand which is hiden comes from the Comb part and the rest bits are rest of significand.

4.5.1 BID

T will be encoded in binary directly.

4.5.2 DPD

For BCD encoding, uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth(10 used / 16 total states).

In DPD encoding, an encoding that maps from decimal to binary is used in order to use the decimal in the mantissa, which is not code like 8421BCD -- it is a waste of space. For the new code, we want to find a positive power of two, such that the ratio of the smaller and closest positive power of ten to it is as close as possible to 1, in order to save space. On the other hand, the positive integer power of the two should be as small as possible, so that the granularity can be small, and it is easier to allocate space to floating-point format with small space.

We use 10bits(0 ~ 210-1=1023) to represent 3 decimal digits.(0~999)

(If we use BCD encoding method to represent decimal digits, 10bits can represent 2~3digits.)

How to represent digits by DPD table?

To the left are the DPD encoded values, and to the right are the original three-digit decimal digits (a.k.a: a declet).

i.e. Let's describe line 3.

The letters `a`,`b`,`c` which have a green background, `g`,`h`,`i` which have a green background, `f` which has a purple background, are same digits on both side. The only difference between both sides the relative positions.

The left binary sequence encoded by DPD `abcghf101i`(10bits) is corresponding to three origin digits separated encoded in binary(BCD code), `0abc` (d2) `100f`(d1) `0ghi`(d0).

How to represent in fomula?

`0abc`(2) * 100(10) + `100f`(2) * 10(10) + `0ghi`(2) * 1(10)

a.k.a.

[0b9b8b7](2) * 100(10) + [100b4](2) * 10(10) + [0b6b5b0](2) * 1(10)

4.6 Model

BID
``````s 00eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 01eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 10eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
``````
``````s 1100eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1101eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1110eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
``````

Special Value.

``````s 11110 xx...x    ±infinity
s 11111 0x...x    qNaN
s 11111 1x...x    sNaN
``````
DPD
``````s 00 TTT (00)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 01 TTT (01)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 10 TTT (10)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
``````
``````s 1100 T (00)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1101 T (01)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1110 T (10)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
``````
Reference

decimal64 floating-point format - Wikipedia

Decimal floating point - Wikipedia