# Random cheap sigmoid

DSP, Plug-in and Host development discussion.
KVRist
142 posts since 10 Oct, 2014
martinvicanek,

The integro differential method is one of the older aliasing limiting technique. It's stunning that they published that as new stuff.

Anyway it works specifically like a charm with x/sqrt(x^2+1) as its integrand is simply sqrt(x^2+1).

I've implemented this scheme on the Axoloti and it is very efficient. On the little ARM STM32F4, the 1/x and sqrt float run in parallel with integer operations. Thus interleaving fixed point calculation leads to something efficient.

Code: Select all

``````int32_t outGain = param_OutGain<<3;
const int32_t *pIn = inlet_in;
int32_t *pOut = outlet_out - 1;
while(pOut < outlet_out + BUFSIZE - 1){
float r;
float y1 = y0;
y0 = arm::vsqrtf(arm::q_to_float(nextSq1, 22));
nextX0 = ___SMMUL(param_InGain, *pIn)<<7;     //execs in parallel
nextSq1 = ___SMMLA(nextX0, nextX0, 1<<22);
int32_t x1_x0 = x1 - x0;  // we avoid division by small values
if(abs(x1_x0) > (1<<17)){  //the divider is not too small
r = y1 - y0;           // go diff !
r /= arm::q_to_float(x1_x0, 27);
} else {  // it's too small use the direct value
r = arm::q_to_float(x0, 27) / y0;
}
x1 = x0;
x0 = nextX0;
pIn++;
pOut++;
*pOut = ___SMMUL(outGain, arm::float_to_q(r, 30));
}``````

KVRer
29 posts since 16 Mar, 2014
You dont even need to worry about division if you rewrite
(sqrt(1 + x^2) - sqrt(1+ x1^2))/(x - x1) = (x + x1)/(sqrt(1 + x^2) + sqrt(1 + x1^2))

KVRist
142 posts since 10 Oct, 2014
Wow, that's a kind of magic !

KVRian
655 posts since 21 Feb, 2006 from FI
Old thread but here's tanh() approximation, but, with not much control steepness of the curve (try variable s on Desmos sheet).

I'm not too familiar with intrinsics so, SSE version may look awkward for someones ... .

Code: Select all

``````#include <emmintrin.h>
#include <math.h>

float simple_tanh(float x)
{
float x2 = x * x;
if (std::abs(x) <= 9.01f) // float implementation seems to result ±1 at ~9.01f
return ((x * (2.0f * x2 + 1.0f))/
(std::abs(x * (2.0f * x2 + 1.0f)) + 1.0f)); // return -1 or 1
else
return ((!signbit(x)) << 1) - 1;
}

__m128 simple_tanh_sse(__m128 x)
{
static const __m128 sign_mask = _mm_set1_ps(-0.0f);
const __m128 c0 = _mm_set1_ps(1.0f);
const __m128 c1 = _mm_set1_ps(2.0f);

__m128 x2 = _mm_mul_ps(x, x);

__m128  p = _mm_mul_ps(x, _mm_add_ps(_mm_mul_ps(c1, x2), c0));

return _mm_div_ps(p, q);
}
``````
Speed test (by 2DaT's routines):

Code: Select all

``````Approx. rdtsc/val... (datasize: 524288)

std::tanh(x)   simple_tanh    simple_tanh_sse
-------------------------------------------------
13.58519         3.34891            2.23675
``````
https://www.desmos.com/calculator/spcb8ev1cm
https://godbolt.org/z/Y3h5xvabd
Last edited by juha_p on Tue May 04, 2021 12:58 am, edited 5 times in total.

KVRAF

Topic Starter

6321 posts since 12 Feb, 2006 from Helsinki, Finland
juha_p wrote:
Mon May 03, 2021 1:34 pm

Code: Select all

``````    if(std::abs(x[0]) <= 9.01f) // float implementation seems to result ±1 at ~9.01f
return _mm_div_ps(p, q);
else
return _mm_set1_ps(((!signbit(x[0])) << 1) - 1); // return -1 or 1
}
``````
This code does not look reasonable. Did you check that it actually works (eg. with random numbers)?

You're testing one element and using it to branch and set sign for the whole vector? This type of branch will also predict very poorly if some of the samples exceed the threshold while others don't. Also, the original point of the thread was to use rsqrt instead of a slower division.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
655 posts since 21 Feb, 2006 from FI
mystran wrote:
Mon May 03, 2021 2:34 pm

This code does not look reasonable. Did you check that it actually works (eg. with random numbers)?

You're testing one element and using it to branch and set sign for the whole vector? This type of branch will also predict very poorly if some of the samples exceed the threshold while others don't. Also, the original point of the thread was to use rsqrt instead of a slower division.

Actually, that (branching) code is not necessary if you don't need to have it exactly ranged [-1,1] by following the tanh() output since, [-1,1} is exceeded somewhere around x=-203, 203] ... so, forget that code part.
Last edited by juha_p on Thu May 06, 2021 1:00 pm, edited 1 time in total.

KVRAF

Topic Starter

6321 posts since 12 Feb, 2006 from Helsinki, Finland
juha_p wrote:
Mon May 03, 2021 7:41 pm
Here's a result sample from my test run (with the branch code):
I was asking because I don't understand how the branch is supposed to work.

Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
655 posts since 21 Feb, 2006 from FI
mystran wrote:
Tue May 04, 2021 1:51 am
...
Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?
Not that one you quoted in your previous reply. That was just for element p[0]. Here's the code which should work for whole vector:

Code: Select all

``````    for(int i = 0; i < 4; i++)
{
if (std::abs(x[i]) >= 9.01f){       // float implementation seems to result ±1 at around 9.01f
p[i] = ((!signbit(p[i])) << 1) - 1;
}
}
return p;
}``````
Last edited by juha_p on Thu May 06, 2021 1:02 pm, edited 2 times in total.

KVRAF

Topic Starter

6321 posts since 12 Feb, 2006 from Helsinki, Finland
juha_p wrote:
Tue May 04, 2021 4:28 am
mystran wrote:
Tue May 04, 2021 1:51 am
...
Does it still produce the correct answer if the SIMD vector is something [12., -.1,-42,1]?
Not that one you quoted in your previous reply. That was just for p[0]. Here's the code which should work for whole vector:
But now you're potentially taking 4 branch prediction misses!

Also keep in mind that accessing individual elements of a vector (other than the first) is not free as it always involves some type of shuffling.

edit: Whenever testing the performance of any optimized routine (especially if it's supposed to be used with audio), I highly suggest testing with white noise to make sure you're maximing the probability of any issues like poor branch prediction, in order to get a realistic idea of what the performance looks like in the worst-case, because it's ultimately the worst-case that matters for real-time programming, even if it's certainly nice to have a faster average case where possible.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
655 posts since 21 Feb, 2006 from FI
mystran wrote:
Tue May 04, 2021 4:58 am
...
Also keep in mind that accessing individual elements of a vector (other than the first) is not free as it always involves some type of shuffling.
Yes, that, mostly unnecessary loop would decrease performance from ~2.2 to ~4.3 (by 2DaT's test routine) ... still 3x faster than tanh() (by same test).

KVRist
425 posts since 21 Jun, 2013
juha_p wrote:
Tue May 04, 2021 4:28 am

Code: Select all

``````    for(int i = 0; i < 4; i++)
{
if (std::abs(x[i]) >= 9.01f){       // float implementation seems to result ±1 at around 9.01f
p[i] = ((!signbit(p[i])) << 1) - 1;
}
}
``````
This defeats the purpose of SIMD in the first place (which is - to process multiple data with single instruction). And this compiles to horrible code. Indexing is not supported for __m128 types, you must never use it (and if it works, it's more expensive than working with actual arrays, esp. on writes).

I wouldn't bother with "cheap" approximation methods such as Pade, etc. Check out this version:
viewtopic.php?p=7503081#p7503081

Division is expensive compared to add-muls, might as well use a better approximation.

KVRAF

Topic Starter

6321 posts since 12 Feb, 2006 from Helsinki, Finland
2DaT wrote:
Tue May 04, 2021 12:02 pm
Division is expensive compared to add-muls, might as well use a better approximation.
Right.. and the original goal of this particular thread was to not use divisions in the first place. Anyone doing divisions is in the wrong thread.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
655 posts since 21 Feb, 2006 from FI
2DaT wrote:
Tue May 04, 2021 12:02 pm
...
This defeats the purpose of SIMD in the first place (which is - to process multiple data with single instruction). And this compiles to horrible code. Indexing is not supported for __m128 types, you must never use it (and if it works, it's more expensive than working with actual arrays, esp. on writes).
...
As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?

KVRist
425 posts since 21 Jun, 2013
juha_p wrote:
Tue May 04, 2021 1:37 pm
As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?
Using _mm_min_ and _mm_max_ intrinsics. It does what you want without all that funky code.

KVRAF

Topic Starter

6321 posts since 12 Feb, 2006 from Helsinki, Finland
2DaT wrote:
Tue May 04, 2021 10:46 pm
juha_p wrote:
Tue May 04, 2021 1:37 pm
As I stated earlier I'm not too good with intrinsics so, that is the result I ended up (first tried to search other solutions of course...). If you know how to do this properly then why not open it up a bit ... or wink some example source codes ?
Using _mm_min_ and _mm_max_ intrinsics. It does what you want without all that funky code.
While min/max is a no-brainer here, it got me wondering: is AVX the minimum to actually do generalized branches?

Basically with AVX one can do _mm_cmp_ps (or the wider equivalents) to get a mask for conditionally selecting values, then follow up with PTEST (_mm_test_all_zeroes, etc; SSE4.1) to get CF or ZF to only compute the branches where at least one value is actually needed, so Turing equivalent computation is possible without ever unpacking the vectors, but is there some trick to do this with just SSE2 with sensible amount of shuffling?
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.