← Previous Contents    Next →  

Bitwise numbers

Two's complement

The most convenient and most common way to represent negative numbers on a computer is to use the two's complement.

// equivalent to y = -x
y = ~x + 1;

For simplicity I illustrate with $8$ bit numbers:

For an $8$ bit number $0x00..0x7F$ will be the numbers $0..127$, and $0x81..0xFF$ will be the numbers $-127..-1$, counting downwards. Note that $0x80$, which is $-128$ is it's own two's complement, as $+128$ would not be possible to represent with $8$ bits. The same is true for any signed integer type, i.e. there is no corresponding positive number to the smallest negative number. This causes (minor) problems when using absolute values as the absolute value of the smalles negative number will be negative and not positive. In other words, if abs() is was defined on $8$ bit numbers, abs(-128) would compute to -128.

So why do we use two's complement? The main reason is that logic for adding negative numbers is the same as the logic for adding positive numbers. Consider the example $$(-5)+7=2$$ We find two's complement of $5$ and do the usual positive addition $$0b11111011+0b00001111=0b00000010$$ which is the correct answer.

If you right shift a negative number, you will be padding with $1$'s instead of $0$'s on the right (at least on my compiler). So for a $16$ bit number,

-32768 >> 15 == -1;

evaluates to true.

Let us conclude by looking at an example of the computation of the absolute value of a signed integer using bitwise arithmetic. We assume our numbers are $16$ bits.

// computing abs(x) - right shift pads with ones on negative numbers
m = x >> 15;
absx = x ^ m - m;

If x is non-negative, then m will be zero, and absx = x ^ 0 - 0, which is x. If x is negative, then m will be 0xFFFF, which is $-1$. In other words, x ^ m - m = ~x + 1, which is the two's complement. So absx is now the absolute value of x, provided x is not the smallest possible negativ number -32768.

Floating point numbers

In scientific and engineering applications we use floating point numbers to approximate the number line $\mathbb{R}$ on a computer. They are introduced to have approximately the same number of numbers between any interval $(10^n, 10^{n+1})$ on the number line, with some restrictions on the size of $n$. Lets look at how the bits of the data type float can look in C++.

It often consists of $32$ bits $b_{31}b_{30}\cdots b_{0}$ giving us the binary number $$(-1)^{b_{31}}1.b_{22}b_{21}\cdots b_0 \cdot 2^{b_{30}b_{29}\cdots b_{23}-127}$$

We have $23$ bits to represent the fractional part of the number, which is about $7-8$ decimal digits.

Lets look more carefully at the bit-structure of a floating point number. In order to read of the bits we need to cast to an unsigned int:

float n = -1400;
unsigned int& i = *(unsigned int *) &n;

The sign can be read of as

unsigned int sign = (i & (1 << 31)) >> 31;

which is $1$ in this example. The fractional part as

unsigned int fraction = i & 0x7FFFFF;

In this case, we see the bits 0101111 which represents the binary number $1.0101111$. This makes sence, since $$1400 = 2^{10}+2^8+2^6+2^5+2^4+2^3$$ or $$1400 = (2^{0}+2^{-2}+2^{-4}+2^{-5}+2^{-6}+2^{-7})\cdot 2^{10}$$ Note that floats do not use two's complement to represent negative numbers and that the bit for $2^0$ is implisit.

The mantissa (or exponent) can be found by

// the bit structure of the exponent unsigned int exponent; // clear the sign exponent = i & (~(1 << 31)); // shift the mantissa exponent = exponent >> 23;

In this case the exponent is $10$.

Some issues with floating point numbers

First of all, a NaN (Not a Number) is any float where the exponent is 0xFF. NaNs appear when we try to compute $sqrt(-1)$ or anything else which is mathematically impossible. Allowing NaNs could possibly slow down floating point calculations and can be turned off in the compiler.

We look at an example illustrating the problem with lack of precision in floats. Say you earn one coin every decisecond, and you want to find out how much money you have after one year. You could do something silly like

float myFortune = 0;
float earnings = 1;
for(int i = 0; i < 10*60*60*24*365; i++) {
    myFortune += earnings;
}

After the computation myFortune is $16777216$ or about $16$ million. But a simple calculation by hand shows that myFortune should be $315360000$ or about $315$ million. The problem here is lack of precision. After the loop has been running for a while you will be adding a very large number to a very small number, so in fact all that you are doing is

myFortune += 0.0;

The example also shows you the risks with using floating point numbers (float, double, ..) to represent money. A fixed point number would be better.

Solving a quadratic equations in C++

So adding a small number to a big number is problematic with floating point numbers. When we subtract numbers of roughly the same size we will also experience problems with loss of precision. We look at how to resolve this issue with the familiar computations of solutions of quadratic equations.

We all remember the formula $$x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$$ for solving quadratic equations.

There are several potential problems when you use this formula with floating point arithmetic. Lets discuss one of the problems and how to improve the situation.

Consider the part of the formula $-b\pm\sqrt{b^2-4ac}$. When $-b$ and $\pm\sqrt{b^2-4ac}$ has opposite sign we could have serious loss of precision. More precisely, if the two numbers are equal to $k$ bits, we will loose $k$ bits of precision in the substraction.

This can be solved by first computing the better addition, e.g. $x_1=-b-\sqrt{b^2-4ac}$ when $b>0$ and then find the other root by $$x_2=\frac{2c}{-b-\sqrt{b^2-4ac}}.$$ Or when $b< 0$, the roots are $$x_1=\frac{-b+\sqrt{b^2-4ac}}{2a}\text{ and }\frac{2c}{-b+\sqrt{b^2-4ac}}.$$

Computing the derivative

We have also seen the formula $$\lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h}$$ which can be used to find an approximation of the derivative. We might also have been told that smaller values for $h$ give better approximations. Unfortunately this is not the case when using floating point numbers. Too small values for $h$ will lead to loss of precision. On the other hand, too large values of $h$ could make the approximation inaccurate. The best possible value depends both on the function that you are trying to differentiate and the size of the input value x.

As a guideline, values around $$h=2^{-12}\cdot x$$ would be a good place to start looking for the optimal choice for $h$, if $h$ has the datatype float.

Floating point conditionals

Consider the following code in C++

for(float i=100.0f; i !=0.0f; i -=0.01) { }

We would expect the loop to run $100/0.01=10000$ times. But it does not; it runs forever. In general it is a bad idea to base conditionals on equality of floats. The code

for(float i = 100.0f; abs(i) < epsilon; i -= 0.01) { }

would be better, where epsilon is a small number which approximates zero. The size of epsilon depends the context. Note however that

for(float i = 100.0f; i != 0.0f; i -= 0.015625) { }

does stop. The example is well behaved since $0.015625=2^{-6}$ is a power of two, and consequently we do not loose precision as we decrease $i$. Note also, due to the presence of NaNs a conditional like

float x;
// working with x
if(x == x) {
}

does not always evaluate to true. Also,

float x;
float y;
// working with x and y
if(x + y == y + x) {
}

can evaluate to false, due to Nan issues.

float x;
float y;
float z;
// working with x,y and z
if(x + (y + z) == (x + y) + z) {
}

can evaluate to false, even without NaNs.

← Previous Contents    Next →