KVR Audio

sonigen · Post by **sonigen** » Tue Aug 06, 2013 11:05 pm

Alright I've been messing about with saturation curves again and have found a TANH aproximation that is pretty damn accurate considering how cheap it is...

Code: Select all


// Algorithm...

x = x / 3.4;
x = clip(x, -1, +1);
x = (abs(x)-2)*x;
x = (abs(x)-2)*x;

// ABS constant

__declspec(align(16)) struct U32Quad
{
    uint32 a,b,c,d;
};

const U32Quad MASK_FOR_ABS = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};

// SSE code

// XMM0 = x
MULSS   XMM0,0.29411f; //  (1/3.4)
MINPS   XMM0,1.0f
MAXPS   XMM0,-1.0f
MOVSS   XMM1,XMM0
ANDPS   XMM0,MASK_FOR_ABS
SUBSS   XMM0,2.0f
MULSS   XMM0,XMM1
MOVSS   XMM1,XMM0
ANDPS   XMM0,MASK_FOR_ABS
SUBSS   XMM0,2.0f
MULSS   XMM0,XMM1

Clocks in at approximately 20 cycles on my cpu.

It looks like this...

red is tanh, yellow is the imposter.

mkdr · Post by **mkdr** » Wed Aug 07, 2013 6:55 am

sonigen wrote:Alright I've been messing about with saturation curves again and have found a TANH aproximation that is pretty damn accurate considering how cheap it is...

Looks very good. Got to try this on some of my algos that use tanh.

You've been doing SSE for a while haven't you? Avoiding DIVPS was so P4 time

Btw. post the definition of MASK_FOR_ABS too. It might cause confusion.
It's something like 2147483647, right?

sonigen · Post by **sonigen** » Wed Aug 07, 2013 8:02 am

Ok added that now.

Avoiding DIVPS was so P4 time

Oh no, still avoid them like the plauge. Just didnt bother in the C code as a figured people know to do that them selves.

mkdr · Post by **mkdr** » Wed Aug 07, 2013 8:39 am

Your ABS_MASK deifinition reminded me; A cool thing to do is to use it on 4 parallel audio streams, with no extra cpu cost. Just use PS commands instead of SS. You just need to input x(the actual audio) into the four segments of an XMM register.. with either MOVSS and SHUFPS or if you can get your compiler to input a 4-part table to a single MOVPS (like the ABS_MASK is done)

Code: Select all

// SSE code

// XMM0 = x
MULPS   XMM0,0.29411f; //  (1/3.4)
MINPS   XMM0,1.0f
MAXPS   XMM0,-1.0f
MOVPS   XMM1,XMM0
ANDPS   XMM0,MASK_FOR_ABS
SUBPS   XMM0,2.0f
MULPS   XMM0,XMM1
MOVPS   XMM1,XMM0
ANDPS   XMM0,MASK_FOR_ABS
SUBPS   XMM0,2.0f
MULPS   XMM0,XMM1

Clocks in at approximately 20 cycles on my cpu.

With 4 tanh's calculated at once, it's now approximately 5 cycles per one tanh.

FastTriggerFish · Post by **FastTriggerFish** » Wed Aug 07, 2013 12:45 pm

Nice one, thanks for posting !
I've got a general question about writing assembly / SSE by hand :
is it really worth it ?
I'm an experienced developer of scientific algos but I've never bothered with that kind of low level optimisation because the dev time would not be viable in my industry, especially since we need to support a wide array of platforms.

But if you do have the time to do this, can you really do better than a good compiler - say LLVM ?
If you write optimisation friendly code ( declaring as const what is const, making loops easily unroll-able etc ) and max out the compiler optimisation flags and then compare with your hand written assembly or SSE, is the difference going to be really significant ?
I'm quite curious about this.

mystran · Post by **mystran** » Wed Aug 07, 2013 2:50 pm

What do you guys think about this:

http://txt.arboreus.com/2013/03/29/fast-sigmoid.html

sonigen · Post by **sonigen** » Wed Aug 07, 2013 7:13 pm

mystran wrote:What do you guys think about this:

http://txt.arboreus.com/2013/03/29/fast-sigmoid.html

There's no way tanh() is that fast. Tanh is about 14 times slower than the fastest sigmoid on my cpu. My guess is either compiler optimization buggering things up, or maybe the fact he's feeding in the raw output of rand(), maybe that shortcuts those functions when they have daft input.

Whats the result of tanh(rand()) 99.99% of the time, +- 1??

He's on GCC and I'm on MSVC so i cant investigate and find out for sure.

Anyway i did some tests, i coded it all in SSE, because it was the only way to get it all in one instruction set.

y = x / (abs(x)+1)

==> 6.6ns

x = x / 3.4;
x = clip(x, -1, +1);
x = (abs(x)-2)*x;
x = (abs(x)-2)*x;

==> 4.2ns

x = clip(x, -1, +1);
x = (abs(x)-2)*x;

==> 3.0ns

Code: Select all

__declspec(align(16)) struct U32Quad 
{ 
    unsigned int a,b,c,d; 
}; 

const U32Quad MASK_FOR_ABS = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF}; 
const float coef134 = 0.29411f;
const float pos1 = 1.0f;
const float neg1 = -1.0f;
const float two = 2.0f;

float withFABS(float x)
{
    __asm
    {
    MOVSS   XMM0,x
    MOVSS   XMM1,XMM0 
    ANDPS   XMM1,MASK_FOR_ABS 
    ADDSS   XMM1,pos1
    DIVSS   XMM0,XMM1 
    MOVSS   x,XMM0
    }
    return x;
}

float ABSXM2X(float x)
{
    __asm
    {
    MOVSS   XMM0,x
    MULSS   XMM0,coef134
    MINSS   XMM0,pos1 
    MAXSS   XMM0,neg1 
    MOVSS   XMM1,XMM0 
    ANDPS   XMM0,MASK_FOR_ABS 
    SUBSS   XMM0,two
    MULSS   XMM0,XMM1 
    MOVSS   XMM1,XMM0 
    ANDPS   XMM0,MASK_FOR_ABS 
    SUBSS   XMM0,two 
    MOVSS   x,XMM0
    }
    return x;
}

float ABSXM2XFAST(float x)
{
    __asm
    {
    MOVSS   XMM0,x
    MULSS   XMM0,coef134
    MINSS   XMM0,pos1 
    MAXSS   XMM0,neg1 
    MOVSS   XMM1,XMM0 
    ANDPS   XMM0,MASK_FOR_ABS 
    SUBSS   XMM0,two
    MULSS   XMM0,XMM1 
    MOVSS   x,XMM0
    }
    return x;
}

arakula · Post by **arakula** » Wed Aug 07, 2013 7:22 pm

I just tried it, totally unoptimized, just to compare its output to the one I currently use, which is based on the algorithm presented in http://www.musicdsp.org/showone.php?id=238 - so I tested with

Code: Select all

    x = x / 3.4f;
    if (x < -1.f)
      x = -1.f;
    else if (x > 1.f)
      x = 1.f;
    else
      {
      x = (fabsf(x) - 2.f) * x;
      x = (fabsf(x) - 2.f) * x;
      }

Looking at what gnuplot displays for [-5..+5], your approximation seems to be a bit closer to tanh than the musicdsp one. Except for approx. [-0.5..+0.5] - in this area, it "overshoots" a bit.

mystran · Post by **mystran** » Wed Aug 07, 2013 7:39 pm

sonigen wrote:
mystran wrote:What do you guys think about this:

http://txt.arboreus.com/2013/03/29/fast-sigmoid.html
There's no way tanh() and exp() are that fast. Tanh is about 14 times slower than the fastest sigmoid on my cpu. My guess is either compiler optimization buggering things up, or maybe the fact he's feeding in the raw output of rand(), maybe that shortcuts those functions when they have daft input.

Well, they are compiler intrinsics (or library functions) so it's probably a case of GCC having a better implementation.

Btw, have you compared the performance of MSVC SSE intrinsics vs inline assembly functions? With GCC you can let the compiler do register allocation for small assembly blocks, but with MSVC that doesn't work, so I wonder if you could avoid some shuffling/spilling overhead by letting MSVC generate the code for you.

sonigen · Post by **sonigen** » Wed Aug 07, 2013 8:27 pm

mystran wrote:Well, they are compiler intrinsics (or library functions) so it's probably a case of GCC having a better implementation.

There's no way to calculate tanh that quickly. Seriously test y = x / (abs(x)+1) vs tanh() with some sensible inputs. It'll be 10x slower, not 10% slower.

Btw, have you compared the performance of MSVC SSE intrinsics vs inline assembly functions? With GCC you can let the compiler do register allocation for small assembly blocks, but with MSVC that doesn't work, so I wonder if you could avoid some shuffling/spilling overhead by letting MSVC generate the code for you.

I checked the disassembly, they were all inlined, the only redundant code is that one extra load/store to the stack. And it's the same for each function. Obviously it'd be different if there was more going on in the surrounding code but it's just a loop with an inline function.

sonigen · Post by **sonigen** » Wed Aug 07, 2013 8:57 pm

FastTriggerFish wrote:Nice one, thanks for posting !
I've got a general question about writing assembly / SSE by hand :
is it really worth it ?

In some circumstances. For example in x86 asm you can do a 32bit unsigned integer multiply and get a 64 bit result spread across 2 registers. Thats really useful for lookup tables. For a 3 cycles you get the integer part in one register and the fractional part in another. You cant get the compiler to do that for you.

Another useful thing is that you can branch based on the cpu flags after they are set by an arithmetic op. For example, you can add the phase step to an accumulator and branch if it overflows. You dont actually need to do a compare. In C++ you'd have to do "if (accum < step)" or something like that. (I'll eat my own elbow if there's a compiler that will see that it could skip the compare and just do a JC)

So there are things you can do in asm that you cant get from the compiler cause there's no way to provide enough information for the compiler to generate that code.

That said, the opportunity for those kind of optimizations are limited.

For SSE I dont know of any compiler that can paralellize your code for you. At least not in the sense that it could automatically paralellize 4 biquads. You have to do that yourself, either with intrinsics or in asm. At least AFAIK.

FastTriggerFish · Post by **FastTriggerFish** » Wed Aug 07, 2013 9:40 pm

Interesting, these are indeed some nice tricks.
Disclaimer : I haven't had the opportunity to look at that stuff closely for a long time and I don't know about MSVC, but GCC, intel compiler and LLVM are all capable of auto-vectorization to some extent, see e.g
http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf

A few years back I tried my hand at some SSE intrinsics and I remember gcc auto vectorisation smoked it, but I didn't go as far as looking at the disassembly ( I've always disliked x86 assembly so yes I have a bias

) so I wouldn't be able to say why.
For something like 4 biquads I agree clearly no compiler will auto parallelise it for you

But if you wrote some pre-optimized code where everything is flattened out in loops of fixed size 4 where the iterations are independent then I would have thought the compiler would do as good, or nearly as good as hand optimised assembly.

Richard_Synapse · Post by **Richard_Synapse** » Thu Aug 08, 2013 11:03 am

FastTriggerFish wrote:Interesting, these are indeed some nice tricks.
Disclaimer : I haven't had the opportunity to look at that stuff closely for a long time and I don't know about MSVC, but GCC, intel compiler and LLVM are all capable of auto-vectorization to some extent, see e.g
http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf

A few years back I tried my hand at some SSE intrinsics and I remember gcc auto vectorisation smoked it, but I didn't go as far as looking at the disassembly ( I've always disliked x86 assembly so yes I have a bias ) so I wouldn't be able to say why.
For something like 4 biquads I agree clearly no compiler will auto parallelise it for you

True, but that's not needed. What's needed is just a basic data type e.g. a quad vector for SSE (why this isn't standard in every programming language is beyond me, surely it can be made abstract enough to work for more than just SSE). You can use wrapper classes like F32vec4 in C++ but there's no guarantee they will work as expected.

Richard

Ichad.c · Post by **Ichad.c** » Thu Aug 08, 2013 5:17 pm

FastTriggerFish wrote:Interesting, these are indeed some nice tricks.
Disclaimer : I haven't had the opportunity to look at that stuff closely for a long time and I don't know about MSVC, but GCC, intel compiler and LLVM are all capable of auto-vectorization to some extent, see e.g
http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf

Yeah, gcc's auto-vectorizer does rock - but only when it works, which is rarely. Anything with 'memory' it won't auto-vectorize, so even with a simple filter like a biquad - it will protest. Haven't tried the new graphite vs. the old gimple method - might help because of the flattening. Has anybody here tried the new graphite method in GCC?

Andrew

Aleksey Vaneev · Post by **Aleksey Vaneev** » Fri Aug 09, 2013 1:16 pm

Here's a fast tanh function I've come up with, using Eurequa, very precise (around 2.7% peak error around 0.0), and calulates very fast, public domain:

Code: Select all

inline double vox_fasttanh( const double x )
{
	const double ax = fabs( x );
	const double x2 = x * x;
	const double z = x * ( 1.0 + ax +
		( 1.05622909486427 + 0.215166815390934 * x2 * ax ) * x2 );

	return( z / ( 1.02718982441289 + fabs( z )));
}

On my i7-3770K computer, using the latest Intel C++ Compiler, here are the times:
math.h tanh 6.91ns
tanh by original poster 3.78ns
vox_fasttanh 1.89ns

These benchmark times include time to organize such for() loop that guarantees no serious optimization except loop unrolling from the compiler:

Code: Select all

const int RepCount = 1000000000;
const double vp = 2000.0 / RepCount;
double s = 0.0;
double v = -1000.0;
int i;
for( i = 0; i < RepCount; i++ )
{
	s += vox_fasttanh( v );
	v += vp;
}

The overhead of this loop is about 0.51 ns (checked by "s += v" instead of tanh function).

Fast TANH aproximation