prepared by: Yongbin Wang prepared for: Dan Ross
arithmetic is a complex in computer systems. Almost every programming language has
a floating-point data type. This paper explains the basic floating-point
arithmetic that is implemented in software and hardware.
point numbers is not easy for computers. People started to use floating point
binary numbers in 1950s. In 1980s, the standard representation of floating
point number of computers was formed by a standards committee, which was formed
by the Institute of Electrical and Electronics Engineers. According the IEEE, a
binary floating-point number is represented as a bit-string characterized by
three components: a sign, a signed exponent, and a significand. This paper does
not intend to introduce binary floating-point arithmetic in good and efficient
design algorithm, but the basic arithmetic operation.
has two basic format sizes: 32 bits single format and 64 bits double format. Numbers
in single and double formats are composed of three fields: 1-bit sign s, 8-bit biased
exponent e = E + bias, and 23-bit fraction f = b1 b2
… bp-1. The fraction has an implicit leading 1 for the
significant digit field. Figure 1 and
figure 2 show a 32-bit single format number X and a 64-bit double format number
Figure 1: Single
Figure 2: Double
representation in IEEE 754 format is interpreted as:X = (-1) sign *
2exponent – bias * 1.fraction. For example, 42.34375 in IEEE 754 –
32-bit binary floating-point format is: 101010.01011, or 1.0101001011 x 25.
Sign bit is 0 because the number is positive. The exponent bias of single
float-point number is 127, so the exponent is (e-127) = 5 or, e = 132 =
100001002. The mantissa is
Let me use one example to show how to
multiply two floating-pint numbers. Let X = -9.0 and Y = 20.25, and their binary
representations are X = -1001.0 and Y = +10100.01, respectively. Their
scientific notations are: X = -1.001 x 23 and Y = +1.010001 x 24.
These two floating-point numbers in IEEE 754 representation are X = 1 10000010
00100000… and Y = 0 10000011 01000100….
The following steps show how to perform a
multiplication of two floating-point number.
1) We first find the mantissa of the
product. A hidden one is added to the most significant bit of the 23-bit mantissas,
which gives 10010000… and 10100010…. The 24-bit mantissas multiplication gives
a 48-bit product, that is 01.01101100 100000…, or 1.01101100 100000… * 20.
If most significant bit here is zero, we need to left shift the result.
Moreover, the exponent of the new mantissa is zero, EM = 0.
2) Secondly, we find the exponent of the
product. The exponent is: Ex + EY + EM – 127 =
10000010 + 10000011 + 0 + 10000001 = 10000110, which is 134.
3) The sign of the result is: SignX
XOR SignY = 1 XOR 0 = 1;
4) Finally, the floating-point
multiplication results Z = 1 10000110 011011001…, or
Z = (-1)1 * 1.011011001* 2134-127.
If we convert this to decimal, Z = – 182.25 = -9.0 * 20.25
Calculation of division in binary floating
point is similar to multiplication. Instead of doing a/b, we can perform a x
1/b. b is taken the reciprocal, and it becomes the multiplier. However, it is
not always easy to get the perfectly precise reciprocal. For example, 1/3 =
0.33333…. but 3 x 0.33333… ? 1. So, we need to be serious to have the correct
rounding. If we use IEEE 754 standard floating-point division to do the
calculation, the sign bit of the result is still using a XOR b. The exponent is
found by E = (Ea – Eb) + bias. And we divide the
mantissas of a and b. Moreover, if the divisor b is zero, the result is set to
infinity. If a and b are both zeros, the result is set to NAN.
Addition & Subtraction
fixed-point number can perform addition than multiplication, floating-point
addition is more difficult than multiplication. Floating-points numbers can
only be added if their exponents are the same. Let’s try to add X = 1.25 and Y=
120.0625. These two floating point numbers in IEEE 32-bit representation are X
= 0 10000000 (1)010000… and Y = 0 10000101 (1)11100000010…. The one and zero
inside the parenthesis are hidden leading bit. We need to align the binary
points, so the smaller exponent, which is 10000000 from X, is increasing by 5 and
the mantissa is shifted right by 5, so their exponents are equal. Thus,
S E M
1000 0101 (0)000 0010 1000
0000 0000 0000
1000 0101 (1)111 0000
0010 0000 0000 0000
1000 0101 (1)111
0010 1010 0000 0000 0000
result is 1.1110010101 x 10 6, which is 131.3125 = X + Y.
performs the same process to get the result. But we firstly need to convert the
negative number to two’s complement before calculation. After the
floating-point addition is finished, we convert the result back to
arithmetic involves rounding, overflow and underflow in real hardware. If we want more accurate results, we need to
use a more precise representation. According to IEEE Standard for Binary
Floating-Point Arithmetic, rounding takes a number regarded as infinitely
precise. The rounding modes may affect the signs of zero sums and the
thresholds beyond which overflow and underflow may be signaled. Division by
zero should be signaled and give a result infinity. Overflow should be signaled
if the final floating-pint result has bigger magnitude. Underflow occurs if the
operation is smaller than the smallest magnitude that is applied to the
application. All these problems have tiny effects in addition and subtraction,
but it has a critical effect in multiplication and division. Programmers must
clearly understand what operation and operands are to deal effectively with
A good system should have both signal and double
basic floating-point formats. A precise computation is important in computer
science curriculum. Computer arithmetic is a large field. Moreover, a standard
format is floating-point representation is very important. According to “What
Every Computer Scientist Should Know About Floating-Point Arithmetic”, by Davis
Goldberg, the increasing acceptance of the IEEE floating-point standard means
that codes that utilize features of the standard are becoming ever more