Zhi Song

Posted on Apr 7, 2021 • Edited on Apr 14, 2021

Basic Understanding of Decimal Floating-Point Number

#computerscience #swift #cpp

English and Chinese are available.

1. Defination of Constants

b，base(radix)，2 or 10
p，precision
emax，max value of exponent
emin，min value of exponent

For all floating-point format，emin = 1- emax

2. Representation of floating-point format

Floating-point format is consisted of radix, encoding bits. (i.e. binary64)
Floating-point number should be represent to this:

(−1)^s × b^e×m

S: is 0 or 1

e: emin <= e <= emax

m is a number represented by a digit string of the form d₀・d₁ d₂...d_p-1 where d_i is an integer digit 0<= d_i <=b(therefor 0 <= m <=b)

Two infinities, +∞ and −∞.

Two NaNs(Not a numbers), qNaN (quiet) and sNaN (signaling)

3. Binary Floating Number

The binary floating-point format is the familiar representation, which is also the standard of IEEE754-1985.

A binary floating-point number will be represented in this form:

S: sign 1-bit
E: exponent w-bit E = e + bias
T: tail t-bit t = p - 1, T = d1d2⋯dp−1 [d_i: binary number]

Param	binary16	binary32	binary64	binary128	binary{k}
k	16	32	64	128	k, 32\
p	11	24	53	113	k-round(4log2 k)+13
emax	15	127	1023	16383	2^{k-p-1}-1
bias	15	127	1023	16383	emax
sign bit	1	1	1	1	1
w	5	8	11	15	round(4log2 k)−13
t	10	23	52	112	k-w-1

Example of binary32

1). Normalized number

E is an 8-bit unsigned integer, representing a range of 0 to 255, and E-bias is a range of -127 to 128

E₀ ~ E_W-1 is not all 0 (E = 0, E-bias is -127) or is not all 1 (E = 255, E-bias is 128).

Now, the formula of floating-point number is (-1)^S × 2^E-bias × m.

m = 1 + 2^-t T = 1 + ∑(i: 1 ~ t) 2^-id_i

E-bias: (-127, 128)

T is represented as a floating-point number that is greater than or equal to 1 and less than or equal to 2 in scientific notation.

S, E, T should be representing and storing in binary.

Example:

float(binary32): 9.0

-> Trans to binary：1001.0

-> Representation：-1⁰ * 2^(3+127) * 1.001 -> -1⁰ * 2⁽¹³⁰⁾ * 1.001 (bias: 127)

-> Layout:

S(+) E(130) T(1)

0 10000010 00000000000000000000001

Tip.

Why does bisa exist?

For E (exponent), E is an unsigned integer, so the value of E is in the range (0~ 255). However, the exponent can be negative in counting, and for the sake of range symmetry, it is required to add the middle number (127) to the original value of E when storing it, and subtract the middle number (127) when using it. So the real range of E is (-127 to 128).

2). Subnormalized Number

When E₀ ~ E_w-1 are 0 (E is 0, the exponent(E-bias) is -126 [NOT -127, due to an unrepresented exception]), leading bit from 1 to 0。

Due to 0<= T< 2^t, 1<= m < 2，normalized number can't represent 0. Therefore, when abs(a number) < b^emin, the number will transform to subnormalized number format to represent.

Now, the formula of floating-point number is (−1)^S×2^emin−tT = (-1)^S2^emin∑(i=1 ~ t) 2^-id_i

(0 is represented to: S 000...000 )

3.）Special Values

+∞：S -- 0 E -- E_i = 1 T = 0

-∞: S -- 1 E -- E_i = 1 T = 0

NaN：Each bit of E is 1

The difference of quiet NaN and signaling NaN is flag bit of significand segment.

Quiet NaN does not raise any additional exceptions (FPUs do not raise hardware exceptions) and they are used in most operations. The exception is that you cannot simply pass NaN to the output intact, for example during format conversion or some comparison operations.

The opposite is Signaling NaN

4. Decimal Floating-Point Format Number

Take a brief look

Decimal Floating-Point format number has two encoding method, one called DPD(Densely Packed Decimal), and the other called BIS(a.k.a BID Binary Integer Decimal).

4.0 Value of Decimal Floating-Point Number

(-1)^S * T * 10^{E - bias}

The values of T and E are calculated from the combined part and the respective continuation part (mantissa).

4.1 Organization

S: Sign 1-bit
Comb: Combination
E: Exponent w-bit E = e + bias
T: Tail t-bit t = p - 1, T = d1d2⋯dp−1

Param	decimal32	decimal64	decimal128	decimal{k}
k	32	64	128	k, 32\
p	7	16	34	9k/32-2
emax	96	384	6144	3*2^(k/16+3)
bias	101	398	6167	emax+p-2
sign bit	1	1	1	1
w	6	8	12	k/16+4
t	20	50	110	15*k/16-10

4.2 Significand -- Difference between BID & DPD

BID and DPD coding are used to code each part into binary for storage. The difference is in the significand section, BID is to directly take the significant digital part of scientific enumbering method and convert it into binary for storage. DPD coding uses a mapping table, and every 3 decimal digits correspond to 10 binary digits. To store.

As for decimal encoding, some hardware directly supports decimal processing, such as IBM POWER. At this time, this standard is directly used to store and calculate numbers, or DPD encoding is needed to convert binary for storage.

4.3 Comb

In both DPD and BID cases, the most significant 4 bits of the significand (which actually only have 10 possible values [0~9]) are combined with the most significant 2 bits of the exponent (3 possible values) to use 30 of the 32 possible values of a 5-bit field. The remaining combinations encode infinities and NaNs.

Combination field	Exponent Msbits	Significand Msbits	Other
00mmm	00	0mmm	—
01mmm	01	0mmm	—
10mmm	10	0mmm	—
1100m	00	100m	—
1101m	01	100m	—
1110m	10	100m	—
11110	—	—	±Infinity
11111	—	—	NaN. Sign bit ignored. First bit of exponent continuation field determines if NaN is signaling.

0 and 100 on Significand Msbits is NOT represent in significand

This part can be calculate from Comb part.

4.3.1 DPD

The Comb part occupied 5 bis for Decimal64 and the 5 bits come from exponent(E) and significand(T).

G₀G₁ G₂G₃G₄

G₀G₁ : is the most two bits of exponent.

G₂G₃G₄ : is the most three bits of significand.

11 G₂G₃ G₄

G₂G₃: is the most two bits of exponent.

G₄: 8₍₁₀₎ + G_{4 (2)} is the most bits of significand.

1111 G₄

Special value, Infinity or NaN

4.3.2 BIS (a.k.a. BID: Binary Integer Decimal)

This means encodes the significand part to binary directly.

The start of exponent bit and significand are uncertain.

The start of exponent is up to the most two bits of the Comb.

Therefore, we index each bit of whole floating-number model from b₀ to b_k-1.

The rules:

When b₁b₂ ₍₂₎ != 11₍₂₎, the exponent part is consist of b₁b₂ and last w bits, the rest of bits are the significand.
When b₁b₂ ₍₂₎ == 11₍₂₎ and b₃b₄ ₍₂₎ != 11₍₂₎, the exponent part is consist of b₃b₄ and last w bits, the rest of bits are the significand.
∞ and NaN are following DPD rules.

4.4 Exponent

As for non-special value, the exponent consist of two Comb(00/01/10) bit and last w bits for BID. Total kinds: 3 * 2^w

As for DPD, the Comb has 5 bits and the Exponent has 8 (w)bits, which consist of two Comb(00/01/10) bit(hiden) and last w bits.

4.5 Significand

In both cases(BID DPD), the most significand which is hiden comes from the Comb part and the rest bits are rest of significand.

4.5.1 BID

T will be encoded in binary directly.

4.5.2 DPD

For BCD encoding, uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth(10 used / 16 total states).

In DPD encoding, an encoding that maps from decimal to binary is used in order to use the decimal in the mantissa, which is not code like 8421BCD -- it is a waste of space. For the new code, we want to find a positive power of two, such that the ratio of the smaller and closest positive power of ten to it is as close as possible to 1, in order to save space. On the other hand, the positive integer power of the two should be as small as possible, so that the granularity can be small, and it is easier to allocate space to floating-point format with small space.

We use 10bits(0 ~ 2¹⁰-1=1023) to represent 3 decimal digits.(0~999)

(If we use BCD encoding method to represent decimal digits, 10bits can represent 2~3digits.)

How to represent digits by DPD table?

To the left are the DPD encoded values, and to the right are the original three-digit decimal digits (a.k.a: a declet).

i.e. Let's describe line 3.

The letters a,b,c which have a green background, g,h,i which have a green background, f which has a purple background, are same digits on both side. The only difference between both sides the relative positions.

The left binary sequence encoded by DPD abcghf101i(10bits) is corresponding to three origin digits separated encoded in binary(BCD code), 0abc (d2) 100f(d1) 0ghi(d0).

How to represent in fomula?

0abc₍₂₎ * 100₍₁₀₎ + 100f₍₂₎ * 10₍₁₀₎ + 0ghi₍₂₎ * 1₍₁₀₎

a.k.a.

[0b₉b₈b₇]₍₂₎ * 100₍₁₀₎ + [100b₄]₍₂₎ * 10₍₁₀₎ + [0b₆b₅b₀]₍₂₎ * 1₍₁₀₎

4.6 Model

BID

s 00eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 01eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 10eeeeeeee   (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt

s 1100eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1101eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1110eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt

Special Value.

s 11110 xx...x    ±infinity
s 11111 0x...x    qNaN
s 11111 1x...x    sNaN

DPD

s 00 TTT (00)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 01 TTT (01)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 10 TTT (10)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]

s 1100 T (00)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1101 T (01)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1110 T (10)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]

Reference

decimal64 floating-point format - Wikipedia

Decimal floating point - Wikipedia

DEV Community