Fixed Point Arithmetic and Tricks

Gabriel Ivancescu, 2007-12-21Revision: 1.0

I see many people get confused at fixed point and believe it's some kind of "hack" because they think floating point are "real" numbers. Real numbers are abstract mathematical representation – computers are not abstract, they do not employ normal mathematics (for example, infinite numbers). It only tries to "emulate" the abstract math (bits being volts, holes, magnetic charges, etc). People who use floating point to represent fractional parts definitely don't know what float is.

This article explains fixed point arithmetic, how it differs from floating point and some "general-purpose" operations to get you started. It is by no means a comprehensive guide – fixed point has very many tricks and I cannot simply explain them all in one article. I do show three examples, however.

Fixed point is relatively easy, which also means quite fast. There are certain advantages and disadvantages over floating point, though I'll only present them briefly here.

The Difference: Float vs. Fixed

In floating point, there is a much higher dynamic range ("float" has 24 bits of accuracy, but can have values up to 2ˆ127). If you use a "general-purpose" format, then the loss of precision in floating point will most probably be much smaller than of a general-purpose fixed point format.

However, depending on your implementation, fixed point can have a much greater precision, but less accuracy for greater dynamic ranges (notice the difference between precision and accuracy). Since I don't work with very big or very small numbers (where floats can just "shift the number" to the left/right with the help of the exponent, even though this yields a very high loss of precision in the number, computations are still done correctly on 24 bits unless you want to cast it into an integer), I prefer fixed point for most of my programs. Too bad there isn't a fixed point capability in hardware; probably because it can be "emulated" much faster than floating point, and it is not as "general-purpose" as float.

Floating Point vs. Real Numbers

Just remember that many people confuse "floating point" with "real" numbers – hence the ability to represent fractional parts. By all means, that is a capability of floating point, but they are not mainly for that purpose – floats are used to represent a large dynamic range by the means of an exponent and a quite good accuracy by the use of a mantissa (24 bits), but when you cast it into an integer or start to use it's value (instead of just "operating" on it), the precision loss will be huge with large exponents (you only have 24 bits of accuracy, and can have up to 2ˆ127 values, so imagine those lots of 0s on the trail if the exponent gets larger than 24…).

Fractions are simply "small exponents" in floats. In truth, float does not distinguish between fractions and integers – it only recognizes a mantissa and an exponent (how much to shift the number left/right). Imagine having a number (mantissa) and moving the point to the left or right (adding 0s as necessary). In fact, if you have a 24 bit mantissa, then exponents larger than +24 will be unable to represent fractional parts anymore, though computations are still done correctly on 24 bits of accuracy. Even larger exponents will be further unable to represent all the integer values (i. e., they skip over them). But you still operate correctly on 24 bits of accuracy unless you start to use the value (like some index) or begin to actually use it's definition instead of just using it for calculations. Casting into an integer gives the same loss in precision.

Another property of floating point arithmetic is that smaller numbers get better precision, because it will use more of the mantissa's bits to represent fractions. But no matter what, you always get 24 bits of accuracy. Of course I'm only referring to 32-bit float format here, other formats such as 64-bit have more mantissa bits but eat more memory.

On the other hand, fixed point does not use exponents. It simply stores the fractional part in a specific number of bits, which you define as you desire. For example 8.16 means 8 bits for integer part, and 16 for fractional.

Fixed Point vs. Integer Numbers

The good thing about fixed point is that it's actually an integer (i. e., good ol' byte, word, dword… whatever), only that you "define" it to be the way you want. You use the same "encoders" as an integer since it is an integer, and you use the same operations (i. e., in x86 you use mul, not fmul) as with standard integers. You can leave it like that, or apply some additional operations to make it the way you want. Remember, the implementation gives life to fixed point. The actual difference between a normal integer and fixed point is the "implementation" and how you use these instructions – for example, putting a multiply and then a shift actually yields the fixed point multiply. The real neat thing about fixed point is that they are designed and implemented exactly for their special purpose (for example you can use all the bits for the fractional part if you know your number will always be between 0 and 1), or a lot of other tricks like these. In fixed point, the design is the most important key – otherwise they are just integers.

It's important to analyze the way stuff works before implementing fixed point variables. For example, in a scenario with RGB colors being scalars between 0 and 1, you realize they are unsigned. Ok, you save 1 bit. Next, you realize you only have fractional parts, no integer parts – thus in floats the exponent is always ≤ 0, so floats will "waste" bits here which could've been used for mantissa precision. In fixed point, you can allocate all the bits you want to the fractional part – no sign, no integer part. This is specific for the implementation; like I said, the implementation brings fixed point to life.

It can be observed from the above notes that floats are not "real" numbers, nor are fixed point numbers. As such, floating point differs from fixed point, they don't "replace" one another since it depends how you use them and what you wish to accomplish – using floats just for representing fractional numbers is not a strong reason to start to use them. You should understand the strengths and weaknesses of both float and fixed representations to actually make a choice. Whoever says floats are for "real" numbers does not know what floats are, obviously. If using exponents is a better solution to your problem (don't think of "real" numbers when using float, instead think of the exponent and mantissa), then by all means floating point is the answer. However when I need to represent fractional parts I usually go for fixed point numbers as they are much more "special-purpose" than floats.

If you find out some faster solutions than what I present here, please let me know and I'll add them. Also please remember that the "operators" I present here are just "general-purpose", so don't take them as the final touch, as they're not even optimal for specific design. It's only to get you started, fixed point hides many tricks, just see for yourself. Testing and analyzing is the key to success here (also a bit of math might help).

Notes for The Fractional Part

In fixed point the fractional part has a fixed number of bits. Also remember that it does not represent values the same as a standard integer would, in decimal. Without scaring you off with "negative" powers I'll just present you an example and then explain how you can get a fractional representation from a decimal one. Let's assume a 8.8 fixed point type:


means "two and half", and is encoded as

   '2'     'half'

So don't be tempted to write

00000010.00000101               <- INCORRECT!

To simply find the fractional part taken from a decimal number (i. e., 2.5) in a calculator or anything else, take 1 and shift it left by the number of fractional bits. In our example, 1<<8 means 256. Next multiply this by the fractional value you wish to find out (0.5 in our case), and voila. Note though that if the result yields a fractional part, it means the respective value cannot be encoded in those number of bits, hence we have a "precision loss". Try increasing the number of fractional bits and see if it reduces the loss. For such "approximated" results I suggest you round it up as it's the best solution. Don't be scared of rounding because such small precision loss won't be too bad.

Basic Arithmetic Operations


It's like normal addition: i. e.,

r = x + y

x, y being fixed point variables of the same format.


Same as normal subtraction: i. e.,

r = x - y

Again, x and y should have the same format (if they don't, try converting).


This one is a bit more involved, but quite easy.

r = x * y

However, x and y don't need to have the same format. The result will have a format containing the sum of both integer part's bits and fractional part's bits in x and y. For example if x is of the format 4.3 (4 bits for integer and 3 for fractional) and y is of the format 5.7, then the result will be of the format 9.10 (9 bits for integer part and 10 for fractional part).

Now we gotta think logically – if we, for example, have two numbers x and y in 8.8 format, then multiplying them will yield a 16.16 result. So, if we want a 8.8 result, we need to shift it right 8 bits. For the integer part, just ignore the upper bits, or do the same as if it overflowed (since you had a 16.16 format and now you want 8.8). Here's an example:

// multiply fixed point number x with y, both are 8.8 format
// we want the result to be of 8.8 format too, so we need to shift right by 8
r = (x * y) >> 8

However, when we do the shifting right, we actually "truncate" the lower bits, meaning we have a precision loss. To get better results we could use rounding, unless you really don't care that much about precision, since truncation is faster :)

Here's the rounding upwards method for 8.8 formats, of course:

temp = (x * y)
r = temp + ((temp & 1<<(8-1))<<1)

This is a general-purpose method. I believe there are faster ways of doing this even in C. In x86 assembly this can be written a lot faster (note that I assume we have a 8.8 fractional number which cannot overflow a 32-bit register; otherwise you'd have to use the edx register too)

 ; eax = r
 ; ebx = x
 ; ecx = y
 mov eax, ebx
 mul ecx
 shr eax, 8
 adc eax, 0

The last instruction does the "magic" job of rounding. This is cute since we only need a single extra instruction for rounding, compared to C's 3 operations.

Furthermore, good tricks like multiplying by a power of 2 still apply. Let's see with the rules (note however that these "rules" are abstract, they do not exist actually, it's just to get you started) – we have a fixed point number of the format (let's say) 8.8, again… Now we want to multiply it by 4. When we shift it left, we multiply it with an integer: 4. This integer has 0 bits for the fractional part, thus the result will have the same number of fractional bits as the number we are multiplying by 4.

// the following code multiplies fixed point x of format 8.8 by 4
// result will be of format 8.8 (at least the fractional format will remain unchanged)
r = x << 2

// the following code multiplies x by "integer" y
// result has same format as x, since y has 0 bits for fractional part
r = x * y

As you would probably guess, multiplication is more flexible than addition since it can multiply all sorts of formats together. This is not actually true. In addition and subtraction, you need to "convert" the format before you do the operation. In multiplication/division, you need to convert the result (i. e., afterwards) to the appropriate format you wish if it doesn't fit directly by the multiply. As explained above, the getting-started rule is easy: the result will contain the sum of the bits in the operands' formats.

Of course these "formats" are just imaginary – truth be told nothing stops you to limit yourself to these "rules". How I actually understood these things was by looking and analyzing the numbers. The math behind this is obvious (shifting left is just a form of multiplying by powers of 2, but much faster). Let's see, if we have a number with "A" fractional bits, then the value "1" will be encoded as 1<<A. So, when we multiply a number x by this "fixed point" number, we need to arrive at the same base number x. However, multiplying x with 1<<A will shift x left by A bits. What will we do so our x will remain unchanged? I'll leave it to you as an exercise. Don't be limited to these rules or the formats – and try to think in binary. I see a lot of people getting confused by fixed point because they think in decimal.


Division doesn't differ much from multiplication, it's actually the opposite. Of course a simple basic rule exists to get you started, but it is not comprehensive. You can do all kind of things with fixed point numbers.

Here's the rule: The result's format will contain the difference of the bits in the operands' format. Thus if you have a number x in 16.16 format and divide this with a number y in 7.10 format, then the result's format will be 9.6 format (the difference between 16.16 and 7.10).

Also because it's a difference, it can happen that negative values occur at the result's format. For example, if we divide x of format 10.10 with y of format 6.16, then the result will have 4.-6 (that's right, the fractional part has -6 bits… oh). That's a bad thing since it will start to "eat" from the integer part. Be careful how you order these operands.

With the above info in mind, let's look at two examples. Again, we assume both operands are in 8.8 format for simplicity. The first example completely truncates the fractional part, hence the result will be an integer. The second one first shifts left the first operand, modifying it's format (actually it isn't based on the rules, but it's a trick… like I said at the last paragraph in multiplication, the rules are not "rules" you must follow, instead they are simple to get you started). After that it divides.

// first example: divides x by y, truncating the fractional value
//                because the result is an integer with 0 bits for the fractional part
r = x / y

// second example: shifts and then divides
//                 this trick keeps the same format in the result as x or y
//                 of course x and y are in 8.8 format in this example
//                 so the result will be in 8.8 format too
r = (x << 8) / y

How does the second trick work? Let's analyze the number (without the rules). When we divide a number x by 1 we need to arrive at the same number x. Of course, 1 is expressed as 1<<8 (or 256) in y because it has 8 bits for the fractional part. This means that when we divide a number x with a fixed point number y with the value 1, we will arrive at the same number x, but shifted right by 8!

To get around this, we first shift x left by 8, so the shifts will cancel out one another. Thus now when we divide x<<8 by 1 in 8.8 format (meaning 1<<8, in fact) we arrive at x which is the desired result in this example. I hope it's clear. The reason I want you to understand this is because fixed point is never understood only by example – you need to understand the inner workings to design it efficiently and make use of all sorts of tricks. Again, the "." is imaginary, just as is fixed point – you only have 1s and 0s to trouble with.

Tips & Tricks

The real beauty of fixed point math is in the formats, because you design them for your needs. You are not limited to only 3 formats like in floats – you can have signed and unsigned fixed point numbers, different fractional number of bits across numbers, etc. Of course all of these depend on the implementation you wish. I haven't discussed signed arithmetic on fixed point numbers, but it's easy to experiment yourself to see the stuff.

You might be thinking that integers can't really have a high dynamic range. Well it depends: if you know the number's range, say it's between xˆ74 and xˆ77, then you can simply use this information without needing exponents (use shifting). On the other hand, if this "range" varies here, consider floating point. Note that if you know the range of your variables, and they have different "ranges", it's possible to use integers to represent the same thing (think of the integer being the mantissa, since the exponent is known at this code implementation, you don't need to store it). For example, given our variable being between xˆ74 and xˆ77, you only need 77-74=3 extra bits to account for this range, without any exponent at all. You can then use shifts and other operators to make it work for other "known" ranges. This is, obviously, "special-purpose", unlike the more "general-purpose" of the floats.

Also remember that fractional parts are useless in addition and subtraction (i. e., same operations as normal integers). The only part of the fractional parts that's important in multiplication/division are the numbers between 0 and 1 (e. g., 0.5, 0.2). When you design a fractional format, think of multiplication by 1, which must result in the same number. Then think of "dividing" the number when you multiply with numbers "under" 1.

At last we come to some examples, I know you've been waiting for this. Be sure to understand the stuff I explained until now before proceeding. But who am I to stop you? :)

Example: Multiplying an Integer by the Reciprocal of a Constant

Haven't you always wanted to multiply integers by the reciprocal of a constant instead of dividing them? Let's assume you want to divide x by 5, and don't want to use divide since you know it's a constant, why not multiply by 0.2? In truth, you are using integers so it might not be possible, but fixed point comes to mind. This is actually a trick, represent the number 0.2 as a fixed point type of (for example) 32 bits, all of them belonging to the fractional part!

Remember the tip at the beginning of this article (the one about finding the fraction from a decimal representation)? Let's do it now :)

We have 32 bits representing our fractional part, so let's start: 1<<32 equals 4294967296. Now multiply this in a calculator or something with 0.2, and the result will be the fraction we'll use:


Oops, we got a fractional part in our result, which means that we cannot represent 0.2 with 32 bits correctly, though. No matter, we don't care about such a small precision loss right now.

Now if we multiply our x which is an integer (32.0 fixed point format :), with our "0.2" in 0.32 fixed point format:

r = (x * 858993459) >> 32

and we just divided x by 5, or multiplied it by 0.2, whatever…

Example: Rotation Matrix

Here's a typical scenario where I prefer fixed points. Suppose you have a rotation matrix, and you want to multiply it with a vector. Ok, a rotation matrix has values between -1 and 1, so why waste bits with the integer part or some exponent? We can simply use all the bits we have or want to the fractional part – all except one, since we need a sign. You can use two's complement or whatever other numbering system, though I recommend you use your processor's native negation format. Remember, the "point" in fixed point is imaginary – they are integers too like any other integer. However, I believe that you will need a "signed" right-shift if you use two's complement. In x86 asm, that is the sar instruction.

The vector can have whatever values you want – note, you don't need a fractional part since you usually don't multiply or divide vectors between them, because it's impossible (mathematically speaking; of course nothing stops you to make weird operations :).

So, you don't need to have the same format for both of the values – the matrix can have 0.32 format and the vector 32.0. It depends on your specific needs. You may say that "normals" are definetely invalid if we use this vector format, but truth be told, fixed point is "specific-purpose"! You don't need to use the same format for all vectors. A standard vector can have a 32.0 format and a "normal" to a surface a 0.32 format (since it is between 0 and 1). Don't be limited to use the same formats over and over again. Specific formats are the most powerful thing of fixed point over floating point.

I'm afraid I have a reason for not showing you how to do this, since doing it yourself makes you understand much more, trust me. And if you understood this article, it's a piece of cake! Consider it an exercise :)

Example: Blending Two RGB Colors with MULTIPLY Mode

A common representation that people use in RGB colors besides the 24-bit one is with floating point scalars. In this case you have Red, Green and Blue stored in 32-bit float each, up to a total of 96 bits per pixel. The numbers themselves are between 0 and 1. If you understood fixed point then you would most probably see how many bytes are wasted here, per color! In my opinion, this floating point stuff is really silly. Even if you want better precision than 24-bit RGB, use 48-bit RGB or 72-bit RGB. The 72-bit one has the same precision as the floating point one, yet it is smaller (this is because floats waste 1 byte per color with the exponent and sign, because they are useless in this situation).

Some guys keep telling me how are you going to do blending like multiply mode if the numbers RGB are between 0 and 255?. That's silly. Floating point is not magical, it's only the exponent that does the job. If we were to emulate this in float (that is without using the hardware FPU), then it would have been very obvious that it's a silly thing to do here. Float doesn't suck, it simply is inappropiate for this purpose. With a little math we can get this to work with "integers". Enough babble… let's get to the example.

In MULTIPLY blending mode, two colors are multiplied together. Since they are between 0 and 1, we expect a darker result unless one of them is 1. How are we going to do this if we have our numbers between 0 and 255? Very simple if we use our imagination and consider them fixed point numbers!

We can assume that each component is of the format 0.8, meaning 0 bits for integer and 8 bits for fractional (0...255). To keep the original format in the result (since we want the result to be between 0...255 too) we use the shift right, and that's it. Look how simple this is, only an extra shift after the multiplication!

r = (color1 * color2) >> 8

However, with this you are unable to represent the white value (255,255,255) in the multiplication since this operation requires it to be 256, not 255. We can quickly fix this with an extra increment:

r = (color1 * (color2+1)) >> 8

Of course you would need to do this for each component if you want to do it right. I only presented the pseudo-code example. I hope this enlightens you a bit. And unlike float, you can do a lot of other blending modes because the format is much simpler (without exponents, etc.) and you can also use the logical operators to make things even trickier. And it can be even trickier in assembly.


These are one of the many things where fixed point really kicks those "floats are for real numbers" freaks, but remember it's still 1s and 0s too, the point is imaginary. I'm not saying floating point sucks, I'm only saying that whoever says floating point are real numbers is wrong, since floating point means "a moving point", i. e., with the exponent, not "real" numbers. Of course it has certain advantages over fixed point, but that doesn't mean it's for "real" numbers, nor is fixed point. They are both just representations and formats in 1s and 0s. Just test and analyze stuff yourself, and discover cool tricks you can do – stuff that "high-level math" leaves blind :)


Continue to discussion board.

You can contact the author using e-mail grey_predator at yahoo com.


2007-12-211.0First public versionGabriel Ivancescu

(dates format correspond to ISO 8601)