Table Lookup for tanh() versus other solutions

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
Image
Don't do it my way.

Post

Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.

Post

kuniklo wrote:
Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.
f**k recompiling. f**k it right in the ear. With a rusty harpoon.

It's 2011.
Image
Don't do it my way.

Post

Borogove wrote:
kuniklo wrote:
Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.
f**k recompiling. f**k it right in the ear. With a rusty harpoon.
If Apple switches over to ARM, my guess is that recompiling will be the least of our worries. Having a laptop platform that none of the existing DAWs or plugins work on, would allow Apple to institute all sorts of exciting new policies about what and how things can run on these computers.

But I digress. I'm currently working on my "higher quality" inverse square root sigmoid code, that uses a single iteration of the Newton-Raphson thing. I'll post the code once it is working.

I'm not saying that I'm going to use the higher quality version in my new plugin. Seriously, 12 bits of accuracy might end up providing some very nice artifacts in my feedback loops. I was thinking about sticking a quantizer in there, but if I could get some artifacts for free, I'll take them.

I do wonder what the actual SNR of the ISRS code is without the Newton iterations, and whether this translates into noise that would be audible, or just some deviation from the "ideal" curve. If the SSE reciprocal square root instruction uses any form of interpolation for its table look up, then the curve would still have a relatively smooth slope, even if things were piecewise continuous. Of course, it could be a very small table with interpolation being used. I just don't have a feel for how the accuracy of the reciprocal square root function translates into output noise.

Sean Costello

Post

valhallasound wrote:If Apple switches over to ARM, my guess is that recompiling will be the least of our worries. Having a laptop platform that none of the existing DAWs or plugins work on, would allow Apple to institute all sorts of exciting new policies about what and how things can run on these computers.
There's no way Apple puts out an ARM laptop without a transparent emulation layer to run existing apps. Performance may suffer but DAWs will run. They've done it beautifully in the past, and get better at it each time. The only question is whether most apps are getting emulated LLVM to ARM or x86 to ARM.
Image
Don't do it my way.

Post

Borogove wrote:f you support a high-quality mode in your algorithm, you should probably implement a version that does one iteration of Newton-Raphson refinement, which gets you better than 20-bit precision for about 3 times the cost (13-19 cycles instead of 5-6*, depending on architecture, which is still hell of cheap).
Please check out the following. Small text size necessary to preserve comments.

Code: Select all

// Calculate a block of the input signal, using the Inverse Square Root Sigmoid
// (http://www.tinygod.com/sigmoid at the bottom of the page)
// implements x / √( x² + 1 ) 
// accuracy improved via one iteration of Newton Raphson method
inline void ProcessISRSBlockPrecise (float *input, float *output, int blockSize)
{
    const float *localin = input;
    float *localout = output;
    const __m128 theOnes = _mm_set_ps1(1.f);
    const __m128 pointFiveVector = _mm_set_ps1(0.5f);
    const __m128 threeVector = _mm_set_ps1(3.f);
		
    for(int i=0;i<blockSize;i+=4)
    {
        __m128 vin = _mm_load_ps(localin + i);	// loads input into SSE variable
        __m128 vtmp = _mm_mul_ps(vin, vin);     // calculate in*in
        __m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // add 1.f to in*in
			
        // reciprocal square root part of code, 
        // with 1 iteration of Newton Raphson method
        // implements the follow C code:
        // x = rsqrt_approx(c);
        // xprime = 0.5 * x * (3 - x*x*c);
        __m128 vrecip = _mm_rsqrt_ps(vtmp2);	// calculates 1/sqrt(in*in+1.f)
        vtmp = _mm_mul_ps(vrecip, vtmp2);       // multiplies square root by input
        vtmp2 = _mm_mul_ps(vtmp, vrecip);       // MULPS 
        vtmp = _mm_sub_ps(threeVector , vtmp2);	// SUBPS
        vtmp2 = _mm_mul_ps(vtmp, vrecip);       // MULPS
        vtmp = _mm_mul_ps(vtmp2, pointFiveVector);  // MULPS
       __m128 vresult = _mm_mul_ps(vtmp, vin);  // multiply 1/sqrt(in*in+1) by in 
       _mm_store_ps(localout + i, vresult);     // place result into output vector
    }
}
[/size]

This seems to work for me. I'm not hearing any audible improvements, but I need to do some more stringent listening. Please let me know if you see any obvious bugs in the code.

Sean Costello

Post

I simplified it a bit:

Code: Select all

template <typename FTYPE>
void process(FTYPE* __restrict pout, FTYPE const * __restrict pin, int n)
{
	for (int i = 0; i < n; i++)
	{
		FTYPE input = pin[i];
		pout[i] = input * rsqrt_1nri(input * input + 1.0f);
	}
}

Code: Select all

float4:

$LL3@process:
  00024 0f 28 0c 01      movaps  xmm1, XMMWORD PTR [ecx+eax]
  00028 0f 28 c1         movaps  xmm0, xmm1
  0002b 0f 59 c1         mulps   xmm0, xmm1
  0002e 0f 58 c3         addps   xmm0, xmm3
  00031 0f 52 d0         rsqrtps xmm2, xmm0
  00034 0f 28 f2         movaps  xmm6, xmm2
  00037 0f 59 f2         mulps   xmm6, xmm2
  0003a 0f 59 f0         mulps   xmm6, xmm0
  0003d 0f 28 c4         movaps  xmm0, xmm4
  00040 0f 5c c6         subps   xmm0, xmm6
  00043 0f 28 f5         movaps  xmm6, xmm5
  00046 0f 59 f2         mulps   xmm6, xmm2
  00049 0f 59 c6         mulps   xmm0, xmm6
  0004c 0f 59 c1         mulps   xmm0, xmm1
  0004f 0f 29 00         movaps  XMMWORD PTR [eax], xmm0
  00052 83 c0 10         add     eax, 16                        ; 00000010H
  00055 4a               dec     edx
  00056 75 cc            jne     SHORT $LL3@process


float4 VEX:

$LL3@process:
  00030 c5 f8 28 04 01   vmovaps xmm0, XMMWORD PTR [ecx+eax]
  00035 c5 f8 59 c8      vmulps  xmm1, xmm0, xmm0
  00039 c5 f0 58 d3      vaddps  xmm2, xmm1, xmm3
  0003d c5 f8 52 ca      vrsqrtps xmm1, xmm2
  00041 c5 f0 59 f1      vmulps  xmm6, xmm1, xmm1
  00045 c5 c8 59 d2      vmulps  xmm2, xmm6, xmm2
  00049 c5 d8 5c d2      vsubps  xmm2, xmm4, xmm2
  0004d c5 d0 59 c9      vmulps  xmm1, xmm5, xmm1
  00051 c5 e8 59 c9      vmulps  xmm1, xmm2, xmm1
  00055 c5 f0 59 c0      vmulps  xmm0, xmm1, xmm0
  00059 c5 f8 29 00      vmovaps XMMWORD PTR [eax], xmm0
  0005d 83 c0 10         add     eax, 16                        ; 00000010H
  00060 4a               dec     edx
  00061 75 cd            jne     SHORT $LL3@process


float8 AVX:

$LL3@process:
  00030 c5 fc 28 04 01   vmovaps ymm0, YMMWORD PTR [ecx+eax]
  00035 c5 fc 59 c8      vmulps  ymm1, ymm0, ymm0
  00039 c5 f4 58 d3      vaddps  ymm2, ymm1, ymm3
  0003d c5 fc 52 ca      vrsqrtps ymm1, ymm2
  00041 c5 f4 59 f1      vmulps  ymm6, ymm1, ymm1
  00045 c5 cc 59 d2      vmulps  ymm2, ymm6, ymm2
  00049 c5 dc 5c d2      vsubps  ymm2, ymm4, ymm2
  0004d c5 d4 59 c9      vmulps  ymm1, ymm5, ymm1
  00051 c5 ec 59 c9      vmulps  ymm1, ymm2, ymm1
  00055 c5 f4 59 c0      vmulps  ymm0, ymm1, ymm0
  00059 c5 fc 29 00      vmovaps YMMWORD PTR [eax], ymm0
  0005d 83 c0 20         add     eax, 32                        ; 00000020H
  00060 4a               dec     edx
  00061 75 cd            jne     SHORT $LL3@process

Post

valhallasound wrote: Please check out the following. Small text size necessary to preserve comments.

Code: Select all

// Calculate a block of the input signal, using the Inverse Square Root Sigmoid
// (http://www.tinygod.com/sigmoid at the bottom of the page)
// implements x / √( x² + 1 ) 
// accuracy improved via one iteration of Newton Raphson method
inline void ProcessISRSBlockPrecise (float *input, float *output, int blockSize)
{
    const float *localin = input;
    float *localout = output;
    const __m128 theOnes = _mm_set_ps1(1.f);
    const __m128 pointFiveVector = _mm_set_ps1(0.5f);
    const __m128 threeVector = _mm_set_ps1(3.f);
		
    for(int i=0;i<blockSize;i+=4)
    {
        __m128 vin = _mm_load_ps(localin + i);	// loads input into SSE variable
        __m128 vtmp = _mm_mul_ps(vin, vin);     // calculate in*in
        __m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // add 1.f to in*in
			
        // reciprocal square root part of code, 
        // with 1 iteration of Newton Raphson method
        // implements the follow C code:
        // x = rsqrt_approx(c);
        // xprime = 0.5 * x * (3 - x*x*c);
        __m128 vrecip = _mm_rsqrt_ps(vtmp2);	// calculates 1/sqrt(in*in+1.f)
        vtmp = _mm_mul_ps(vrecip, vtmp2);       // multiplies square root by input
        vtmp2 = _mm_mul_ps(vtmp, vrecip);       // MULPS 
        vtmp = _mm_sub_ps(threeVector , vtmp2);	// SUBPS
        vtmp2 = _mm_mul_ps(vtmp, vrecip);       // MULPS
        vtmp = _mm_mul_ps(vtmp2, pointFiveVector);  // MULPS
       __m128 vresult = _mm_mul_ps(vtmp, vin);  // multiply 1/sqrt(in*in+1) by in 
       _mm_store_ps(localout + i, vresult);     // place result into output vector
    }
}
[/size]

This seems to work for me. I'm not hearing any audible improvements, but I need to do some more stringent listening. Please let me know if you see any obvious bugs in the code.

Sean Costello
Sorry to resurrect a thread which is getting a little old, but I am learning SSE instructions and this is one of the prime examples I was able to find on this board.

I'm not sure if this is an actual issue, but shouldn't you use _mm_loadu_ps instead of _mm_load_ps for "// loads input into SSE variable"? Because I believe that _mm_load_ps assumes that the float is aligned - and your float pointer localin is not aligned unless input is aligned. I think in order to align you need to do something like

Code: Select all

float *localIn = (float*) _aligned_malloc(blockSize * sizeof(float), 16);
and then copy in input's values manually to the newly allocated aligned array. Does that make sense? So you either need to use loadu instead of load or you have to align and copy in manually. I'm guessing that if it works as-is then either input and output are already aligned, or maybe you are lucky and it works on your system, or maybe the compiler is smart and did it for you... if I had to guess I would say the compiler is handling it for you, which I think may be a big performance hit. Not sure.

I am also unsure as to why you're using localin and localout at all instead of using input and output directly, does this have to do with alignment... or something else?

Thanks for any insight,

-Colin

Post

noisetteaudio wrote: I'm guessing that if it works as-is then either input and output are already aligned
They are. Things would crash horribly in the XP versions of my code if they weren't aligned. Not sure why they would crash, but I have found this in the past.
I am also unsure as to why you're using localin and localout at all instead of using input and output directly, does this have to do with alignment... or something else?
This is more out of habit than anything else. A lot of my code uses the same pointer for the input and output, and sometimes bad things happen if the low level vector functions overwrite the input vector. I could probably change things in this function without worrying about overwriting stuff - I'll take a second look at it.

Sean Costello

Post

So the pointers you're passing to the function are allocated somewhere elsewhere in a way that ensures they are aligned? Have you found a good cross-platform way to allocate aligned arrays? Eg, _malloc_aligned but GCC compatible.

FWIW, I have found that on my new MacBook Pro with 10.7, trying to use load instead of loadu (for example) on a non-aligned array will also cause a crash. I bet on different architectures/OSs mileage varies.

-Colin

Post

noisetteaudio wrote:So the pointers you're passing to the function are allocated somewhere elsewhere in a way that ensures they are aligned? Have you found a good cross-platform way to allocate aligned arrays? Eg, _malloc_aligned but GCC compatible.
In my OSX vector code, I declare the following:

Code: Select all

#define _aligned_malloc(N, A) malloc(N)
#define _aligned_free(P) free(P)
This was based on a suggestion from this subforum, where someone pointed out that all alignments in the GCC used in Xcode are automatically aligned to 16-bit boundaries. I found lots of confirmation of this in online searches. Otherwise, you could use some cross-platform alignment code in the above #define statements.
FWIW, I have found that on my new MacBook Pro with 10.7, trying to use load instead of loadu (for example) on a non-aligned array will also cause a crash. I bet on different architectures/OSs mileage varies.
The crashes can be kind of helpful, in that they keep bad coding from creeping into other OSes/DAWs that are less tolerant.

An example of this: I have written plenty of AUs, that work great in Ableton Live, but crash in Logic. I tend to swear at Logic in these cases, but invariably the crash is due to some code error on my part.

Sean Costello

Post

Interesting. I will investigate thoroughly where this crash is coming from, as it shows up quite reliably when I switch from loadu to load, without doing alignment - but I believe you that Mac's GCC is thought to 16 byte align (although to be fair I am using LLVM GCC right now).

Post

My obsessive nature turned its attention to tanh() recently, and I figured this thread would be as good as any to post an update on what I found. I tried a whole bunch of different things and have come up with what I consider a gold-plated SIMD tanh that is only moderately more expensive than the "good enough for rock'n'roll" one that Sean posted earlier.

First, Urs wondered how much error the recip sqrt estimate would add, and Sean wondered whether it would be artifacts or just a different shape to the curve. The answer is that it's about 70dB down. The error seems to be piecewise linear (a continuous function of the input, in other words), so not too bad with most inputs. But put a 20 Hz sine into it and what you get sounds aliasy and digital. So it is something you could probably get away with but I would not want it for the absolute highest quality work. I would do the Newton-Raphson step, which seems to put the non-harmonic error down well below 120dB.

Second, on the difference between the sqrt based sigmoid and "real" tanh. The wonderful thing about tanh is that the higher harmonics in the spectrum drop off rapidly, so even a small amount of oversampling will make the aliased harmonics all but inaudible (4x would seem to be good enough for just about all practical purposes). The falloff of harmonics is also true for the sqrt-based sigmoid (it is very good), but not quite as much as the real tanh. For example, with a sine wave of amplitude 1, the seventh harmonic is about 10dB higher in the sqrt sigmoid than tanh. This is audible, for example, when putting a 6200 Hz sine in, where the seventh harmonic aliases to 700 Hz at 44.1KHz sampling rate. (Note that this is also a 12.4kHz sine input with 2x oversampling). So, again, sqrt is good, but tanh is a little better.

Fortunately, it's not too bad to compute a moderately high accuracy approximation:

Code: Select all

a = x + 0.16489087 * x**3 + 0.00985468 * x**5
return a / sqrt(1 + a * a)
This function has the same shape as tanh and almost identical harmonic structure. In particular, it doesn't need any clamping for large inputs, as it has the correct asymptotic approach to +/-1. It's within 2e-4 of the real tanh, which is not super impressive from a strictly numerical point of view, but gives some idea how closely it fits. By comparison, x/sqrt(1 + x^2) has a max error 7.4e-2 as an approximation to tanh (ie almost 400 times as much error).

More importantly, the approximation is made out of smooth functions, so all of whatever small error is left comes out as low harmonics, where they're totally masked by the legitimate harmonic distortion of the input signal. By contrast, I found that the approximations that had abs() or clamping tended to bring the error to high harmonics. In all my poking and prodding with sine waves, I was not able to find a case where you could audibly tell the difference between this approximation and calling tanh() directly.

A straightforward SIMD implementation takes a 1.04ns/tanh on my MacBook Pro, which is about 1.7x the time as the sqrt one with the Newton-Raphson step (and still about 8x faster than calling tanhf). So there are likely few applications in which the extra CPU cost of the higher quality approximation would be an issue.

So I'd say this is what you should use when you want the absolute highest quality, though other approximations may well be good enough for most applications.

I could clean up and post my IPython notebook for this if people want to see it.

Post

raphx wrote:My obsessive nature turned its attention to tanh() recently
Cool!
First, Urs wondered how much error the recip sqrt estimate would add, and Sean wondered whether it would be artifacts or just a different shape to the curve. The answer is that it's about 70dB down. The error seems to be piecewise linear (a continuous function of the input, in other words), so not too bad with most inputs. But put a 20 Hz sine into it and what you get sounds aliasy and digital. So it is something you could probably get away with but I would not want it for the absolute highest quality work. I would do the Newton-Raphson step, which seems to put the non-harmonic error down well below 120dB.
I use the Newton-Raphson step in my release code (ValhallaÜberMod is where I made use of this). Now I'm interested in the aliasing noise you describe without the Newton-Raphson. I might be able to use that in the future. :D
Second, on the difference between the sqrt based sigmoid and "real" tanh. The wonderful thing about tanh is that the higher harmonics in the spectrum drop off rapidly, so even a small amount of oversampling will make the aliased harmonics all but inaudible (4x would seem to be good enough for just about all practical purposes). The falloff of harmonics is also true for the sqrt-based sigmoid (it is very good), but not quite as much as the real tanh. For example, with a sine wave of amplitude 1, the seventh harmonic is about 10dB higher in the sqrt sigmoid than tanh. This is audible, for example, when putting a 6200 Hz sine in, where the seventh harmonic aliases to 700 Hz at 44.1KHz sampling rate. (Note that this is also a 12.4kHz sine input with 2x oversampling). So, again, sqrt is good, but tanh is a little better.
That is very interesting. I hadn't done any real studies of the harmonic generation of the sqrt sigmoid.
Fortunately, it's not too bad to compute a moderately high accuracy approximation:

Code: Select all

a = x + 0.16489087 * x**3 + 0.00985468 * x**5
return a / sqrt(1 + a * a)
This function has the same shape as tanh and almost identical harmonic structure. In particular, it doesn't need any clamping for large inputs, as it has the correct asymptotic approach to +/-1. It's within 2e-4 of the real tanh, which is not super impressive from a strictly numerical point of view, but gives some idea how closely it fits. By comparison, x/sqrt(1 + x^2) has a max error 7.4e-2 as an approximation to tanh (ie almost 400 times as much error).
What is interesting about this approximation, is that it is essentially the sqrt sigmoid, with a polynomial used to process the input signal. Is this correct?

If this IS correct, the next question: Does this code also benefit from using the Newton-Raphson method? It seems like the same reciprocal square root approximation issues would arise when using SSE2, or other SIMD variants that have an approximation to reciprocal square roots.

The followup to the above question: If the tanh() approximation benefits from the Newton-Raphson method, what is the performance of the tanh() approximation in SIMD, compared to other tanh() approximations?

Sean Costello

Post

Also, hehehe, if a Newton-Raphson step is not required, I wonder if the reciprocal square root estimate is guaranteed to produce same results on every CPU - I remember that the G5 has a far worse version than the G4.

It does sound like a great method though… I'll check it out shortly and compare to to our Pade-based version. I guess not needing boundary checks will already provide for a speed bump :)

Post Reply

Return to “DSP and Plugin Development”