Finite Arithmetic

This course is still very much in the rough!

Summary

This course belongs to the track Numerical Algorithms in the Department of Scientific Computing in the School of Computer Science.

In this course, students will learn how numbers are represented on modern-day computers, where the limits and problems of such numbers are and how they may be overcome.

Representation

On binary computers, whole numbers (Integers) are stored in bit arrays of fixed size. These arrays contain the binary representation of the decimal number.

The decimal number 37, for instance, can be converted to binary and stored in an 8-bit variable

37_{10} \rightarrow 100101_2 \rightarrow

The range of numbers that can be stored in n bits is 0 \dots 2^n-1.

This trivial scheme works rather well, since binary arrays are integer by nature. However, mathematics is usually not restricted to integer arithmetic, so a different representation has to be found for real-valued numbers, such as 0.1, \sqrt{2}, \pi and the like.

The most trivial approach for real-valued numbers would be to pack them into two integers, one representing the integer part, the other the fraction.

3.7_{10} \rightarrow a.b \rightarrow 11_2 . 10110011_2 \rightarrow

Notice that the fractional part is not 1101_2 = 7_{10}, yet 0.10110011_2 = 0.7_{10}. The kth digit to the right of the "." in binary is 2^{-k}

0.1_b 0.5_{10}
0.01_b 0.25_{10}
0.001_b 0.125_{10}
0.0001_b 0.0625_{10}
0.\underbrace{00\dots0}_{\times k}1 2^{-(k+1)}

Using such a representation -- generally called a fixed-point representation -- results in very simple operations (addition, multiplication, division, etc...) yet it's range would be limited from

2^{-B_b} \dots 2^{B_a}-2^{-B_b}

where B_a and B_b are the number of bits in the integer and fractional parts respectively. This is not a very wide range.

Furthermore, for large numbers, we are carrying around a lot of lower-order bits which we probably don't need since, as we will see later, we are usually interested in relative accuracy (i.e. the first few digits of a number) as opposed to the absolute accuracy of the last digits of the fractional part. Likewise, for small numbers of the order of 2^{-B_b}, we are only precise up to a few digits although we are carrying around a bunch of zeros we don't really need.

Instead of using two number additively, as in the example above, we could represent our real numbers as the product of a fixed precision number and a factor of 2

 a \times 2^b

This is equivalent to shifting a by b bits. The "." in a can therefore be set arbitrarily. for all practical purposes, we set it after the first digit.

3.7_{10} \rightarrow 1.110110011_2 \times 2^{1_b} = 11.10110011_2

This is the concept of modern floating-point numbers as described in IEEE 754. The range of numbers that can be represented is from 1 \times 2^{-2^{B_b}} to (2 - 2^{-(B_a-1)}) \times 2^{2^{B_b}}.

TODO: Show how sum, multiplication and division are computed.

Exercises


                 4+3+4+2=3 (mod 5)

IEEE 754 Floating Point Numbers

Show representation for single and double precision, sizes of mantissa and exponent

First bit of mantissa is always 1, normalized numbers, exponent is biased for easier arithmetic.

Explain different rounding modes.

Useful/Important Values

Concept of \varepsilon. Definition as \min_\varepsilon 1 + \varepsilon > 1 and spacing between numbers 1.0 and 2.0.

Realmin and Realmax

Show how these values can be computed in Matlab, C and Pascal

Note on problem with extra bits (90 bits) on Intel Pentiums.

Truncation

Basic explanation of the problem with an example, preferrably with a figure.

Explain how it can be avoided by sorting or with Kahan summation.

Cancellation

Give an example illustrating the problem.

No trick, problems need to be re-formulated to avoid subtracting numbers of similar magnitude.

Show some examples how this can be done.

Exercises


This article is issued from Wikiversity - version of the Monday, February 22, 2010. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.