Floating Point

Arithmetic

CISC

310

12/1/2017

prepared by: Yongbin Wang prepared for: Dan Ross

student professor

Abstract

Floating-point

arithmetic is a complex in computer systems. Almost every programming language has

a floating-point data type. This paper explains the basic floating-point

arithmetic that is implemented in software and hardware.

Introduction

Computing floating

point numbers is not easy for computers. People started to use floating point

binary numbers in 1950s. In 1980s, the standard representation of floating

point number of computers was formed by a standards committee, which was formed

by the Institute of Electrical and Electronics Engineers. According the IEEE, a

binary floating-point number is represented as a bit-string characterized by

three components: a sign, a signed exponent, and a significand. This paper does

not intend to introduce binary floating-point arithmetic in good and efficient

design algorithm, but the basic arithmetic operation.

Binary Floating-Point

Numbers

Floating-point number

has two basic format sizes: 32 bits single format and 64 bits double format. Numbers

in single and double formats are composed of three fields: 1-bit sign s, 8-bit biased

exponent e = E + bias, and 23-bit fraction f = b1 b2

… bp-1. The fraction has an implicit leading 1 for the

significant digit field. Figure 1 and

figure 2 show a 32-bit single format number X and a 64-bit double format number

X, respectively.

Figure 1: Single

Format

Figure 2: Double

Format

The

representation in IEEE 754 format is interpreted as:X = (-1) sign *

2exponent – bias * 1.fraction. For example, 42.34375 in IEEE 754 –

32-bit binary floating-point format is: 101010.01011, or 1.0101001011 x 25.

Sign bit is 0 because the number is positive. The exponent bias of single

float-point number is 127, so the exponent is (e-127) = 5 or, e = 132 =

100001002. The mantissa is

01010010110000000000000.

Multiplication

&Division

Let me use one example to show how to

multiply two floating-pint numbers. Let X = -9.0 and Y = 20.25, and their binary

representations are X = -1001.0 and Y = +10100.01, respectively. Their

scientific notations are: X = -1.001 x 23 and Y = +1.010001 x 24.

These two floating-point numbers in IEEE 754 representation are X = 1 10000010

00100000… and Y = 0 10000011 01000100….

The following steps show how to perform a

multiplication of two floating-point number.

1) We first find the mantissa of the

product. A hidden one is added to the most significant bit of the 23-bit mantissas,

which gives 10010000… and 10100010…. The 24-bit mantissas multiplication gives

a 48-bit product, that is 01.01101100 100000…, or 1.01101100 100000… * 20.

If most significant bit here is zero, we need to left shift the result.

Moreover, the exponent of the new mantissa is zero, EM = 0.

2) Secondly, we find the exponent of the

product. The exponent is: Ex + EY + EM – 127 =

10000010 + 10000011 + 0 + 10000001 = 10000110, which is 134.

3) The sign of the result is: SignX

XOR SignY = 1 XOR 0 = 1;

4) Finally, the floating-point

multiplication results Z = 1 10000110 011011001…, or

Z = (-1)1 * 1.011011001* 2134-127.

If we convert this to decimal, Z = – 182.25 = -9.0 * 20.25

Calculation of division in binary floating

point is similar to multiplication. Instead of doing a/b, we can perform a x

1/b. b is taken the reciprocal, and it becomes the multiplier. However, it is

not always easy to get the perfectly precise reciprocal. For example, 1/3 =

0.33333…. but 3 x 0.33333… ? 1. So, we need to be serious to have the correct

rounding. If we use IEEE 754 standard floating-point division to do the

calculation, the sign bit of the result is still using a XOR b. The exponent is

found by E = (Ea – Eb) + bias. And we divide the

mantissas of a and b. Moreover, if the divisor b is zero, the result is set to

infinity. If a and b are both zeros, the result is set to NAN.

Addition & Subtraction

Unlike

fixed-point number can perform addition than multiplication, floating-point

addition is more difficult than multiplication. Floating-points numbers can

only be added if their exponents are the same. Let’s try to add X = 1.25 and Y=

120.0625. These two floating point numbers in IEEE 32-bit representation are X

= 0 10000000 (1)010000… and Y = 0 10000101 (1)11100000010…. The one and zero

inside the parenthesis are hidden leading bit. We need to align the binary

points, so the smaller exponent, which is 10000000 from X, is increasing by 5 and

the mantissa is shifted right by 5, so their exponents are equal. Thus,

S E M

0

1000 0101 (0)000 0010 1000

0000 0000 0000

+ 0

1000 0101 (1)111 0000

0010 0000 0000 0000

—————————————————————————————-

0

1000 0101 (1)111

0010 1010 0000 0000 0000

The

result is 1.1110010101 x 10 6, which is 131.3125 = X + Y.

Subtraction

performs the same process to get the result. But we firstly need to convert the

negative number to two’s complement before calculation. After the

floating-point addition is finished, we convert the result back to

sign-magnitude form.

Exceptions

Floating-point

arithmetic involves rounding, overflow and underflow in real hardware. If we want more accurate results, we need to

use a more precise representation. According to IEEE Standard for Binary

Floating-Point Arithmetic, rounding takes a number regarded as infinitely

precise. The rounding modes may affect the signs of zero sums and the

thresholds beyond which overflow and underflow may be signaled. Division by

zero should be signaled and give a result infinity. Overflow should be signaled

if the final floating-pint result has bigger magnitude. Underflow occurs if the

operation is smaller than the smallest magnitude that is applied to the

application. All these problems have tiny effects in addition and subtraction,

but it has a critical effect in multiplication and division. Programmers must

clearly understand what operation and operands are to deal effectively with

floating-point arithmetic.

Conclusion

A good system should have both signal and double

basic floating-point formats. A precise computation is important in computer

science curriculum. Computer arithmetic is a large field. Moreover, a standard

format is floating-point representation is very important. According to “What

Every Computer Scientist Should Know About Floating-Point Arithmetic”, by Davis

Goldberg, the increasing acceptance of the IEEE floating-point standard means

that codes that utilize features of the standard are becoming ever more

portable.