KVR Audio

Smashed Transistors · Thu Mar 21, 2019 11:07 pm

martinvicanek,

The integro differential method is one of the older aliasing limiting technique. It's stunning that they published that as new stuff.

Anyway it works specifically like a charm with x/sqrt(x^2+1) as its integrand is simply sqrt(x^2+1).

I've implemented this scheme on the Axoloti and it is very efficient. On the little ARM STM32F4, the 1/x and sqrt float run in parallel with integer operations. Thus interleaving fixed point calculation leads to something efficient.

Code: Select all

int32_t outGain = param_OutGain<<3;
const int32_t *pIn = inlet_in;
int32_t *pOut = outlet_out - 1;
while(pOut < outlet_out + BUFSIZE - 1){
  float r;
  float y1 = y0;
  y0 = arm::vsqrtf(arm::q_to_float(nextSq1, 22));
  nextX0 = ___SMMUL(param_InGain, *pIn)<<7;     //execs in parallel
  nextSq1 = ___SMMLA(nextX0, nextX0, 1<<22);
  int32_t x1_x0 = x1 - x0;  // we avoid division by small values
  if(abs(x1_x0) > (1<<17)){  //the divider is not too small
    r = y1 - y0;           // go diff !
    r /= arm::q_to_float(x1_x0, 27);
  } else {  // it's too small use the direct value 
    r = arm::q_to_float(x0, 27) / y0;
  }
  x1 = x0;
  x0 = nextX0;
  pIn++;
  pOut++;
  *pOut = ___SMMUL(outGain, arm::float_to_q(r, 30));
}

martinvicanek · Post by **martinvicanek** » Fri Mar 22, 2019 12:51 am

You dont even need to worry about division if you rewrite
(sqrt(1 + x^2) - sqrt(1+ x1^2))/(x - x1) = (x + x1)/(sqrt(1 + x^2) + sqrt(1 + x1^2))

Smashed Transistors · Sat Mar 23, 2019 11:57 am

Wow, that's a kind of magic !

juha_p · Post by **juha_p** » Mon May 03, 2021 9:34 pm

Old thread but here's tanh() approximation, but, with not much control steepness of the curve (try variable s on Desmos sheet).

I'm not too familiar with intrinsics so, SSE version may look awkward for someones ...

.

Code: Select all

#include <emmintrin.h>
#include <math.h>

float simple_tanh(float x)
{
    float x2 = x * x;
    if (std::abs(x) <= 9.01f) // float implementation seems to result ±1 at ~9.01f
        return ((x * (2.0f * x2 + 1.0f))/
                (std::abs(x * (2.0f * x2 + 1.0f)) + 1.0f)); // return -1 or 1
    else
        return ((!signbit(x)) << 1) - 1;
}

__m128 simple_tanh_sse(__m128 x)
{
    static const __m128 sign_mask = _mm_set1_ps(-0.0f);
    const __m128 c0 = _mm_set1_ps(1.0f);
    const __m128 c1 = _mm_set1_ps(2.0f);

    __m128 x2 = _mm_mul_ps(x, x);

    __m128  p = _mm_mul_ps(x, _mm_add_ps(_mm_mul_ps(c1, x2), c0));
    __m128  q = _mm_add_ps(_mm_andnot_ps(sign_mask, p), c0);

    return _mm_div_ps(p, q);
}

Speed test (by 2DaT's routines):

Code: Select all

Approx. rdtsc/val... (datasize: 524288)

std::tanh(x)   simple_tanh    simple_tanh_sse
-------------------------------------------------
13.58519         3.34891            2.23675

https://www.desmos.com/calculator/spcb8ev1cm
https://godbolt.org/z/Y3h5xvabd

mystran · Post by **mystran** » Mon May 03, 2021 10:34 pm

juha_p wrote: ↑Mon May 03, 2021 9:34 pm

Code: Select all

    if(std::abs(x[0]) <= 9.01f) // float implementation seems to result ±1 at ~9.01f
        return _mm_div_ps(p, q);
    else
        return _mm_set1_ps(((!signbit(x[0])) << 1) - 1); // return -1 or 1
}

This code does not look reasonable. Did you check that it actually works (eg. with random numbers)?

You're testing one element and using it to branch and set sign for the whole vector? This type of branch will also predict very poorly if some of the samples exceed the threshold while others don't. Also, the original point of the thread was to use rsqrt instead of a slower division.

juha_p · Post by **juha_p** » Tue May 04, 2021 3:41 am

mystran wrote: ↑Mon May 03, 2021 10:34 pm
This code does not look reasonable. Did you check that it actually works (eg. with random numbers)?

You're testing one element and using it to branch and set sign for the whole vector? This type of branch will also predict very poorly if some of the samples exceed the threshold while others don't. Also, the original point of the thread was to use rsqrt instead of a slower division.

Sorry hijacking your thread.

Actually, that (branching) code is not necessary if you don't need to have it exactly ranged [-1,1] by following the tanh() output since, [-1,1} is exceeded somewhere around x=-203, 203] ... so, forget that code part.

mystran · Post by **mystran** » Tue May 04, 2021 9:51 am

juha_p wrote: ↑Tue May 04, 2021 3:41 am Here's a result sample from my test run (with the branch code):

I was asking because I don't understand how the branch is supposed to work.

Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?

juha_p · Post by **juha_p** » Tue May 04, 2021 12:28 pm

mystran wrote: ↑Tue May 04, 2021 9:51 am ...
Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?

Not that one you quoted in your previous reply. That was just for element p[0]. Here's the code which should work for whole vector:

Code: Select all

    for(int i = 0; i < 4; i++)
    {
            if (std::abs(x[i]) >= 9.01f){       // float implementation seems to result ±1 at around 9.01f
                p[i] = ((!signbit(p[i])) << 1) - 1;
            }
    }
    return p;
}

mystran · Post by **mystran** » Tue May 04, 2021 12:58 pm

juha_p wrote: ↑Tue May 04, 2021 12:28 pm
mystran wrote: ↑Tue May 04, 2021 9:51 am ...
Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?
Not that one you quoted in your previous reply. That was just for p[0]. Here's the code which should work for whole vector:

But now you're potentially taking 4 branch prediction misses!

Also keep in mind that accessing individual elements of a vector (other than the first) is not free as it always involves some type of shuffling.

edit: Whenever testing the performance of any optimized routine (especially if it's supposed to be used with audio), I highly suggest testing with white noise to make sure you're maximing the probability of any issues like poor branch prediction, in order to get a realistic idea of what the performance looks like in the worst-case, because it's ultimately the worst-case that matters for real-time programming, even if it's certainly nice to have a faster average case where possible.

juha_p · Post by **juha_p** » Tue May 04, 2021 1:08 pm

mystran wrote: ↑Tue May 04, 2021 12:58 pm ...
Also keep in mind that accessing individual elements of a vector (other than the first) is not free as it always involves some type of shuffling.

Yes, that, mostly unnecessary loop would decrease performance from ~2.2 to ~4.3 (by 2DaT's test routine) ... still 3x faster than tanh() (by same test).

2DaT · Post by **2DaT** » Tue May 04, 2021 8:02 pm

juha_p wrote: ↑Tue May 04, 2021 12:28 pm

Code: Select all

    for(int i = 0; i < 4; i++)
    {
            if (std::abs(x[i]) >= 9.01f){       // float implementation seems to result ±1 at around 9.01f
                p[i] = ((!signbit(p[i])) << 1) - 1;
            }
    }

This defeats the purpose of SIMD in the first place (which is - to process multiple data with single instruction). And this compiles to horrible code. Indexing is not supported for __m128 types, you must never use it (and if it works, it's more expensive than working with actual arrays, esp. on writes).

I wouldn't bother with "cheap" approximation methods such as Pade, etc. Check out this version:
viewtopic.php?p=7503081#p7503081

Division is expensive compared to add-muls, might as well use a better approximation.

mystran · Post by **mystran** » Tue May 04, 2021 8:49 pm

2DaT wrote: ↑Tue May 04, 2021 8:02 pm Division is expensive compared to add-muls, might as well use a better approximation.

Right.. and the original goal of this particular thread was to not use divisions in the first place. Anyone doing divisions is in the wrong thread.

juha_p · Post by **juha_p** » Tue May 04, 2021 9:37 pm

2DaT wrote: ↑Tue May 04, 2021 8:02 pm ...
This defeats the purpose of SIMD in the first place (which is - to process multiple data with single instruction). And this compiles to horrible code. Indexing is not supported for __m128 types, you must never use it (and if it works, it's more expensive than working with actual arrays, esp. on writes).
...

As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?

2DaT · Post by **2DaT** » Wed May 05, 2021 6:46 am

juha_p wrote: ↑Tue May 04, 2021 9:37 pm As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?

Using _mm_min_ and _mm_max_ intrinsics. It does what you want without all that funky code.

mystran · Post by **mystran** » Wed May 05, 2021 10:06 am

2DaT wrote: ↑Wed May 05, 2021 6:46 am
juha_p wrote: ↑Tue May 04, 2021 9:37 pm As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?
Using _mm_min_ and _mm_max_ intrinsics. It does what you want without all that funky code.

While min/max is a no-brainer here, it got me wondering: is AVX the minimum to actually do generalized branches?

Basically with AVX one can do _mm_cmp_ps (or the wider equivalents) to get a mask for conditionally selecting values, then follow up with PTEST (_mm_test_all_zeroes, etc; SSE4.1) to get CF or ZF to only compute the branches where at least one value is actually needed, so Turing equivalent computation is possible without ever unpacking the vectors, but is there some trick to do this with just SSE2 with sensible amount of shuffling?

Random cheap sigmoid