## Floating type for dsp

13 posts
• Page

**1**of**1**- KVRian
- 965 posts since 17 Apr, 2005

There was one or more papers years ago, related to the common motorola 56K fixed point dsp chip. As best I recall.

The computational size could be either 24 or 48 bit fixed, and it was concluded that iir filters, especially at low frquencies where some coefficients can be very small, that performance is much better at 48 bit fixed.

Personally, I try to do as many computations in double as possible, and keep temp results in doubles whenever it is convenient. But even if that is a proper decision, I'm quickly falling further behind. I have a bunch of objects and functions written in fpu asm for speed, and some time if I ever bother to compile 64 bit apps, most likely will have to either go back to high level code or rewrite all that crap in sse, which as best I understand has some differences in container size and such.

The computational size could be either 24 or 48 bit fixed, and it was concluded that iir filters, especially at low frquencies where some coefficients can be very small, that performance is much better at 48 bit fixed.

Personally, I try to do as many computations in double as possible, and keep temp results in doubles whenever it is convenient. But even if that is a proper decision, I'm quickly falling further behind. I have a bunch of objects and functions written in fpu asm for speed, and some time if I ever bother to compile 64 bit apps, most likely will have to either go back to high level code or rewrite all that crap in sse, which as best I understand has some differences in container size and such.

- KVRAF
- 4191 posts since 11 Feb, 2006, from Helsinki, Finland

It's worth noting that a numerically poor algorithm (direct forms anyone?) in double precision can be worse than a good algorithm in single precision.

<- plugins | forum

- KVRian
- 736 posts since 1 Dec, 2004

Imho, both float and double are appropriate. I think it's not very common to see cases where float has insufficient precision - float is perfectly appropriate, except as an accumulator for sample position and a few other similar "time accumulator" cases. As a format for audio stream from effect to effect, I think it's perfect.

It's also not very common to see an appreciable speed gain when switching from double to float: you get to process 4 floats at the same time instead of 2 doubles if you use SSE, but using SSE isn't very common. Afaik the most important speed gain is that it takes half as much space (so it eats up half as much memory bandwidth and it uses half as much cache).

It's also not very common to see an appreciable speed gain when switching from double to float: you get to process 4 floats at the same time instead of 2 doubles if you use SSE, but using SSE isn't very common. Afaik the most important speed gain is that it takes half as much space (so it eats up half as much memory bandwidth and it uses half as much cache).

- KVRian
- 965 posts since 17 Apr, 2005

abique wrote:Is SSE something that can be generated for you by the compiler or you need to write asm yourself?

Thanks.

I'd like to know that as well. Well, something like it.

The limited reading I've done, gives the impression that with some compilers, the default is usually sse when compiling for 64 bit. In some compilers, the default appears to be fpu for compiling to 32 bit. Then it seems that some compilers give you a choice of either fpu or sse for 32 bit.

Additionally, it seems that some allow a switch to compile for FPU in 64 bit mode. Hadn't studied it much, and didn't even know whether FPU will work or not, in 64 bit mode.

I guess I'd have to rework my asm to be 64bit compatible in addressing modes and such, but if it is indeed possible to use the FPU in 64 bit mode, it ought to save me a lot of work if I ever build for 64 bit.

- KVRAF
- 4191 posts since 11 Feb, 2006, from Helsinki, Finland

Well, if you tell your compiler that it's allowed to use SSE (or SSE2 in case of doubles) then basically all floating point math will automatically results in SSE code (in the sense that SSE unit will be used instead of FPU). In 64-bit this is always the case.

But what MadBrain was referring to with 2-way vs. 4-way is the SIMD operations that work on multiple values at a time, and for the most part that involves you writing those manually. There's no need to use asm, you can use compiler intrinsics and let the compiler still deal with things like register allocation for you.

Now .. some compiler do have some sort of auto-vectorization that is supposed to basically do some of this automatically at least in the cases where it's reasonably straight-forward, turning regular floating point code into SIMD code.. YMMV.

But what MadBrain was referring to with 2-way vs. 4-way is the SIMD operations that work on multiple values at a time, and for the most part that involves you writing those manually. There's no need to use asm, you can use compiler intrinsics and let the compiler still deal with things like register allocation for you.

Now .. some compiler do have some sort of auto-vectorization that is supposed to basically do some of this automatically at least in the cases where it's reasonably straight-forward, turning regular floating point code into SIMD code.. YMMV.

<- plugins | forum

- KVRist
- 228 posts since 15 Apr, 2012, from Toronto, ON

mystran wrote:There's no need to use asm, you can use compiler intrinsics and let the compiler still deal with things like register allocation for you.

I believe you still need to watch out for register spilling, however. If memory serves, there are 8 registers for SSE and 16 for AVX, and if your intrinsics use more than that, the compiler will have to push and pop temporary values onto the stack which will bascially wipe out the performance gains you get from SIMD operations.

- KVRAF
- 4191 posts since 11 Feb, 2006, from Helsinki, Finland

LemonLime wrote:mystran wrote:There's no need to use asm, you can use compiler intrinsics and let the compiler still deal with things like register allocation for you.

I believe you still need to watch out for register spilling, however. If memory serves, there are 8 registers for SSE and 16 for AVX, and if your intrinsics use more than that, the compiler will have to push and pop temporary values onto the stack which will bascially wipe out the performance gains you get from SIMD operations.

It's no different from any other code you write; if you have too many live variables (the actual number of variables is usually irrelevant) you end up with some of them spilled. That's why compilers do register allocation in the first place, it's purpose is basically to try to find a way to minimize the spill costs. Typical heuristics will usually spill something long-lived though, since typically one can then allocate the same registers to a large number of short lived temporaries.

Anyway, you have 8 registers for SSE in 32-bit, and 16 in 64-bit. These are the same registers that the compiler will use for floating point code too, when it's not using x87.

<- plugins | forum

- KVRAF
- 4645 posts since 16 Feb, 2005

mystran wrote:It's worth noting that a numerically poor algorithm (direct forms anyone?) in double precision can be worse than a good algorithm in single precision.

Of course! Using less operations to do the same thing is always the edge to less noise.

I have been researching it for a little while, just now I have been looking at rounding error as a function of the input domain. If you have a 6 bit integer input (converted to float with no error), you can use a maximum of 4 multiplicands with intermediate results under 2^24 with 0!! error from the mantissa.

The same arrangement in double type could use a 13 bit integer domain.

- KVRAF
- 4191 posts since 11 Feb, 2006, from Helsinki, Finland

camsr wrote:mystran wrote:It's worth noting that a numerically poor algorithm (direct forms anyone?) in double precision can be worse than a good algorithm in single precision.

Of course! Using less operations to do the same thing is always the edge to less noise.

Actually it's not quite that simple. It's the magnitudes of values that are typically important, not so much the number of operations, and often you can improve things by actually doing a bit more calculations!

For example, the basic 2D rotation algorithm that can be used a generator for exponential sinusoids (x is the cosine, y is the sine, obviously):

- Code: Select all
`newX = oldX * cos(p) + oldY * sin(p)`

newY = oldY * cos(p) - oldX * sin(p)

This is an example of an algorithm that suffers from loss of precision with small angles, as the cosine term is very close to 1 and the sine term is very close to 0. As a result, you'll end up pretty poor rounding behavior.

The same thing can be written instead as:

- Code: Select all
`tmpX = oldX * (-2*sin(p/2)^2) + oldY * sin(p)`

tmpY = oldY * sin(p) + oldX * (-2*sin(p/2)^2)

newX = oldX + tmpX;

newY = oldY + tmpY;

This is the same thing using cos(p) = 1 - 2*sin(p/2)^2, and you get the original formula by optimizing the identity out, but in the "less optimal" version the temporaries are calculated with two coefficients that are very close to zero 0 and the error accumulation will be dramatically slower.

It's basically the same problem with cosines that cause direct forms to perform so poorly at low frequencies and by using something like modified couple form (which can be extended to a general filter) or state variable, you get rid of the cosines and the whole low-frequency problem basically vanishes... but in terms of raw number of operations, those will be worse than your typical direct form.

In these cases, the extra cost is fairly negligible though. In other cases, you might end up with much uglier trade-offs (where the cost of keeping precision might be much larger).

<- plugins | forum

- KVRAF
- 4645 posts since 16 Feb, 2005

mystran wrote:camsr wrote:mystran wrote:It's worth noting that a numerically poor algorithm (direct forms anyone?) in double precision can be worse than a good algorithm in single precision.

Of course! Using less operations to do the same thing is always the edge to less noise.

Actually it's not quite that simple. It's the magnitudes of values that are typically important, not so much the number of operations, and often you can improve things by actually doing a bit more calculations!

Well I was speaking strictly in terms of multiplication. Addition is an entirely different problem that suffers when addends are further from 0.

Since addition's resolution is limited by the domain AND the mantissa, it only stands that values more true to the number line are more accurate. Multiplication is different in the fact that we can use the exponent to do 100% accurate ops over the entire domain, example being by adding or subtracting the exponent a perfect multiplication by a power of 2 can be had. The mantissa once again limits how many values will actually fall on a "perfect float". What I am checking on is how much range is shrunk by multiplying sequentially, and how far can a float go with a*b*c*d*... before ANY quantization is introduced.

- KVRAF
- 4191 posts since 11 Feb, 2006, from Helsinki, Finland

camsr wrote:Since addition's resolution is limited by the domain AND the mantissa, it only stands that values more true to the number line are more accurate. Multiplication is different in the fact that we can use the exponent to do 100% accurate ops over the entire domain, example being by adding or subtracting the exponent a perfect multiplication by a power of 2 can be had.

Oh right... but even with multiplies, one would typically like any sensitive coefficients to be small, so they can be represented accurately [edit: well, meaning a small deviation in values of large scale should not result in a large deviation in the results]. Returning to my previous example, simply storing the (fixed) a cosine coefficient for a small angle (so it's close to 1, forcing a "large" exponent) will result in more deviation than storing something like -2*sin(2/p)^2 which is close to zero all the way. So in this case you can lose precision even before any "runtime" calculation actually done.

<- plugins | forum